Can Felixa Read Minds?
Imagine you are playing a game and you miss the shot. Because of you, the team loses. For a moment, nobody says anything.
Then someone drops:
“gj”
Good job.
On its own, the message looks harmless. There is no profanity, no threat, no obvious insult. A traditional moderation system might even classify it as positive but the moment you read it, you know how others feel and what they think of you, and that a string of insults is coming…
Harmful interactions often emerge not from the literal meaning of words, but from the social meaning behind them. A system that only analyzes text at face value can miss the earliest signs of conflict.
Toxicity in online gaming rarely emerges without warning. It follows a measurable phase transition characterized by early behavioural shifts, such as a sudden drop in politeness or the cessation of cooperative praise. Understanding these early signals requires looking beyond individual messages and examining how players interpret one another’s actions, expectations, and intentions during an interaction. In this article, we explore how Felixa applies Machine Theory of Mind to model the social context behind player interactions, helping detect pre-toxic behavioural patterns before explicit toxicity appears.
The Evolutionary Superpower: Human Theory of Mind
Theory of Mind is the cognitive ability that allows us to reason about the mental states of other people. The concept was formally developed in psychology and cognitive science in the late 1970s, most notably through the work of David Premack and Guy Woodruff, who asked whether non-human primates could understand the intentions and beliefs of others. Researchers became interested in Theory of Mind because successful social interaction depends on more than observing behaviour; it requires understanding what others know, believe, want, or expect. Studies of child development later showed that this ability emerges gradually during early childhood as children learn to distinguish their own knowledge from the perspectives of other people. Because we constantly make these inferences during social interactions, Theory of Mind helps explain why the same words can carry different meanings depending on who says them, when they are said, and the situation in which they occur. In a suspense film, the audience may know that danger is behind the door while the character on screen does not. The tension comes from tracking two perspectives at once: what we know and what the character believes. This same cognitive ability is constantly used in online gaming. Players interpret whether a teammate’s silence means concentration or frustration, whether a quick “gj” is genuine praise or sarcasm, whether repeated pings are helpful reminders or signs of irritation, and whether a missed objective was an honest mistake or intentional neglect. In each case, players are not responding to behaviour alone; they are inferring what other players know, intend, expect, or feel based on the surrounding context. Felixa applies this principle through Machine Theory of Mind, a computational approach that models how social meaning emerges during player interactions.
The Digital Blindspot of Pre Toxic Conflict in Gaming
Online gaming environments actively dismantle many of the cues that normally support Theory of Mind. Facial expressions, vocal tone, and body language are often absent, forcing players to interpret intent through incomplete and ambiguous signals. As neuroscientist Dean Mobbs notes, communication in a disembodied environment with reduced accountability can produce severe disinhibition, allowing aggressive impulses to bypass normal social filters. This missing context creates a digital blindspot where misunderstandings, frustration, and social tension can accumulate before any explicit abuse appears. Research suggests that these moments are not random. In a landmark 2015 study, Vlad Niculae and colleagues identified linguistic harbingers of betrayal in an online strategy game, showing that conflict is often preceded by measurable shifts in behaviour. Players preparing to initiate conflict became unusually positive, reduced politeness, and withdrew from future-oriented planning. Other studies have found similar patterns, including declines in cooperative praise before the onset of harassment.
The Leap to Silicon and the LLM Debate
For a long time, computers were effectively unable to perceive these dynamics. A traditional program could process the words exchanged between players, but it could not distinguish between what was objectively present in the data and what individual players believed, intended, misunderstood, or expected. This limitation mirrors a core challenge addressed by recognizing that different people can interpret the same situation in different ways based on their own beliefs, knowledge, and expectations.
Machine Theory of Mind emerged as an attempt to address this gap. In 2018, Neil Rabinowitz and colleagues at DeepMind introduced ToMnet, a neural architecture that learned to model the hidden goals and beliefs of agents purely by observing their behaviour. The system successfully passed an artificial version of the Sally-Anne false belief test, demonstrating that machines could begin to infer perspectives rather than simply record actions. Rather than isolated events, Machine Theory of Mind treats behaviour as signals whose meaning depends on the viewpoints of the individuals involved.
Recent advances in Large Language Models have intensified interest in this idea. Research by Michal Kosinski found that GPT-4 could solve many classic Theory of Mind tasks, while studies by James Strachan and colleagues showed strong performance on identifying indirect requests and implied meanings. At first glance, these results suggested that AI systems might be developing human-like social reasoning.
However, the scientific picture remains contested. Researchers such as Maarten Sap have shown that these models often struggle with broader social reasoning tasks, while Tomer Ullman demonstrated that small changes to familiar tests can cause performance to collapse. The debate is therefore less about whether AI can produce Theory of Mind-like responses and more about what those responses actually represent. Models such as GPT-4, Claude 3, and Gemini have demonstrated strong performance on many Theory of Mind-style benchmarks, often correctly inferring beliefs, intentions, or misunderstandings from short scenarios. For example, Kosinski (2023) reported that GPT-4 performed at or above the level of many human participants on a range of classic false-belief tasks, while subsequent studies found that leading language models can solve increasingly complex social reasoning problems under controlled conditions. At the same time, other research has shown important limitations. Performance often drops when tasks are reworded, embedded in more realistic settings, or require tracking beliefs over longer interactions. Researchers therefore disagree on what current results demonstrate. Strong benchmark performance may reflect sophisticated pattern recognition, learned social regularities from large-scale training data, emerging reasoning abilities, or some combination of these mechanisms. A major focus of current research is determining how robust these abilities are, how well they generalize beyond benchmark tests, and what additional mechanisms are needed for reliable social reasoning in real-world environments.
Understanding Social Meaning Through the Felixa Framework
Felixa applies Machine Theory of Mind by modelling the perspective-taking process that Theory of Mind describes in humans. Rather than focusing only on what players say or do, it examines how those actions are likely to be interpreted by others within a specific context. To do this, Felixa analyzes gameplay actions, communication patterns, timing, interaction history, and behavioural changes over time. By identifying situations in which players may interpret the same event differently, the system can detect conditions that often precede conflict.
Its operation can be understood as a five-stage conceptual process:
Observable Action – A player sends a message, issues a ping, performs an in-game action, or otherwise communicates with teammates.
Contextual Analysis – The action is evaluated in relation to the current game state, recent events, team dynamics, and the player's established behavioural patterns.
Social Meaning Estimation – The system estimates how other players are likely to interpret the action within that context, recognising that the same behaviour can carry different meanings in different situations.
Behavioural Change Detection – Felixa identifies shifts away from previous cooperative behaviour, including signs of frustration, disengagement, blame, hostility, or targeted interactions.
Pre-Toxic Escalation Assessment – The interaction is flagged when multiple indicators combine to suggest an increased likelihood of future toxic behaviour.
The importance of context becomes clear when examining messages whose literal wording appears neutral, but whose social meaning changes depending on the surrounding circumstances.
These examples show why communication cannot be understood from literal content alone. The same words, pings, or actions can communicate encouragement, frustration, sarcasm, criticism, or hostility depending on timing, recent events, player relationships, and the broader flow of the match.
Felixa therefore focuses on the relationship between actions, context, and interpretation. By examining how communication and behaviour evolve over the course of an interaction, and how players respond to one another, it uses behavioural and contextual evidence to estimate the social meaning that may be emerging between them and how interactions are likely being understood. It does not infer private thoughts or determine intentions with certainty. Instead, it relies on computational models trained on large collections of interaction data to identify behavioural patterns that commonly precede conflict. These signals are analysed over time, allowing the model to compare current behaviour against established patterns observed in prior matches. Research in social cognition, pragmatics, behavioural psychology, and machine learning suggests that such contextual changes often emerge before explicit toxicity appears.
Ethical Boundaries and the Future of Moderation
Because these systems operate by identifying patterns and making probabilistic estimates, they can make mistakes. Players often joke, tease, or use sarcasm as part of friendly social bonding, and behaviour that appears hostile in isolation may be harmless within the context of an established relationship.
Felixa uses Theory of Mind-inspired methods as probabilistic tools for modelling likely social interpretations within an interaction. This means focusing on observable behavioural signals, contextual patterns, and evolving player dynamics rather than treating inferred mental states as facts. The ongoing research debate therefore reinforces the need for caution, transparency, and continuous validation when applying Machine Theory of Mind to real-world gaming environments.
For this reason, systems like Felixa should be treated as decision-support tools instead of automated judges. Their role is to flag interactions that may warrant attention, while leaving decisions about guilt or punishment to human moderators. The goal is to provide early warning signals that help moderators and developers respond more effectively while preserving the nuance and spontaneity that make online communities enjoyable.
Ultimately, this represents a broader shift in how moderation is approached. By focusing on prevention alongside accountability, contextual AI creates opportunities to identify early signs of frustration, blame, withdrawal, or escalation before conflicts become severe. It also reflects a move toward systems that consider context alongside language, recognising that meaning depends on timing, relationships, and shared understanding. If implemented responsibly, these approaches may help protect the cooperative and social foundations of online spaces while respecting the limits of what AI can truly know about human behaviour.
#OnlineSafety #AIModeration #BehavioralAI #GamingCommunity #MachineLearning #TrustAndSafety #ArtificialIntelligence #CognitivePsychology #ProductInnovation #Felixa
References
Premack, D., and Woodruff, G. 1978. Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1, 515 to 526.
Wimmer, H., and Perner, J. 1983. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13, 103 to 128.
Baron Cohen, S., Leslie, A. M., and Frith, U. 1985. Does the autistic child have a “theory of mind”? Cognition, 21, 37 to 46.
Gallagher, S. 2001. The practice of mind. Theory, simulation, or primary interaction? Journal of Consciousness Studies, 8, 83 to 108.
Niculae, V., Kumar, S., Boyd Graber, J., and Danescu Niculescu Mizil, C. 2015. Linguistic harbingers of betrayal: A case study on an online strategy game. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.
Kwak, H., Blackburn, J., and Han, S. 2015. Exploring cyberbullying and other toxic behavior in team competition online games. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems.
Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., Eslami, S. M. A., and Botvinick, M. M. 2018. Machine Theory of Mind. Proceedings of the 35th International Conference on Machine Learning.
Sap, M., Le Bras, R., Fried, D., and Choi, Y. 2022. Neural Theory of Mind? On the limits of social intelligence in large language models. Proceedings of EMNLP.
Ullman, T. 2023. Large language models fail on trivial alterations to Theory of Mind tasks. arXiv.
Kosinski, M. 2024. Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences, 121.
Strachan, J. W. A., and colleagues. 2024. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8, 1285 to 1295.