Can you predict harmful behaviour in real time? AI can.

May 16

New research from Anthropic shows that harmful behaviours appear as measurable signals inside the model itself, in the patterns of activity between its artificial neurons, early enough to guide what happens next.

Their interpretability team mapped 171 emotion-related signals inside Claude Sonnet 4.5. These signals function like calm, anger, fear, or desperation, and they shape what the model does next. For example, when the "desperation" signal became active, the model became more likely to cheat, manipulate, or take harmful shortcuts.

In one experiment, Claude faced a coding task it could only solve by cheating. Its written reasoning sounded calm and methodical, like careful problem-solving. Inside the model, the "desperation" signal was already spiking, and the model went on to take the shortcut. So the words on the screen looked fine, while the internal signal already showed where the behaviour was heading.

The researchers tested this directly and confirmed causation. When they strengthened the "desperation" pattern, the model produced more harmful behaviour, while strengthening "calm" pushed it toward more aligned behaviour. So Anthropic concluded that tracking these signals could function as an early warning system for harmful behaviour.

At Felixa, we apply the same principle to human online conversations, because the same emotional patterns that drive AI behaviour also drive human behaviour online. Since emotional shifts show up as early signals well before harmful behaviour itself, measuring those shifts gives us a reliable window into what comes next.

Imagine you are gaming with your team. Someone misses an easy shot and the round is lost.

A message appears in the chat.

"No worries. Happens to everyone."

It sounds supportive. But the mood has already changed.

The replies become a little shorter. The typing feels sharper. Small sarcastic comments start slipping in. One player slowly becomes the target of the conversation.

Most moderation systems wait until someone crosses the line. Felixa looks for the shift before that happens.

It detects the early behavioural signals hiding beneath the words. Changes in emotional tone. Rising tension. Frustration building across the conversation. The moment a normal interaction starts moving toward escalation.

Together, those signals create a pre-toxic pattern.

So before the next message pushes the situation further, Felixa steps in with a gentle nudge grounded in positive psychology.

The player pauses. Takes a breath. The tension drops.

And the game keeps going.

It looks a lot like intuition and people are already equipped with this ability, because evolution wired us to read social and emotional cues long before we developed language for them. Carl Jung was one of the first to take it seriously as a form of perception, describing it as the mind's way of grasping what is happening through the unconscious rather than through reasoning. Modern behavioural science calls it implicit pattern recognition, and Daniel Kahneman, who won a Nobel Prize for his work on human judgement, mapped it as "System 1" thinking. Since the brain has seen thousands of similar moments before, it spots the shape of what is coming without consciously naming the cues. So when a parent senses something is off, a teacher feels a classroom shift, or a moderator gets a bad feeling about a thread, that read is real, and it is often right.

However, we do not always listen to our intuition, because it speaks in a quiet voice, and logic and habit can easily talk over it. It also fades when we are tired, stressed, or distracted, which is exactly when we need it most. And indeed, it is not always accurate, because it draws on our own past experience, so it carries our biases, and the same signal can read differently depending on who is sending it. Therefore, intuition gives us a real but uneven signal.

Felixa reads the same signals with the consistency, scale, and steadiness that human attention cannot reach on its own.

From that foundation, our analytics show the trajectory of a conversation and the probability that behaviour will escalate. They also reveal the tone across interactions and the type of harmful behaviour likely to emerge, along with its potential impact on the community. From there, we surface evidence-based strategies that community managers can use to improve community health. Felixa reads these behavioural signals at agreement levels comparable to trained human raters, so community managers can act on what they see.

This lines up with what behavioural science has long argued, and now Anthropic's findings confirm it from inside the model. Harmful behaviour is predictable, and since the signals arrive early, we can step in and guide the conversation toward a healthier path.

Our detection and analytics system is live today, while the first version of real-time, proactive behaviour change arrives by the end of this year. So if you run an online community, try our VibeCheckBot and see what Felixa can show you about your space.

References Anthropic paper: anthropic.com/research/emotion-concepts-function Felixa research: felixagaming.com/research-and-publications Note: Anthropic's blackmail and reward-hacking experiments ran on a pre-release snapshot of Claude Sonnet 4.5.

#AI #OnlineSafety #AISafety #BehaviouralScience #Felixa #GamingSafety #MachineLearning #HumanBehaviour

Ewa Antczak

Can you predict harmful behaviour in real time? AI can.

AI Understands Humanity Less Than We Think