
"The AI feels regret": How models decide not to break the rules
Max Fomin, a researcher at cyber AI company Zenity, explains how activations reveal an AI's internal moral compass.
It's a situation we've probably all experienced at least once in the last two years. We ask an AI assistant like ChatGPT or Gemini a question, and it provides such a strange answer that the immediate response is: "What the hell was going through its mind?" This problem concerns not only users but also researchers in the field who are trying to decipher and understand how the models think: what happens when they are asked to perform an illegal action, how they respond to queries on topics like programming, and even how they perceive themselves. And the methods? Surprisingly similar to methods for analyzing the ways humans think.
"When people are asked something, they have a sequence of associations. They think about all sorts of topics that make them think about other topics that lead to the answer," Max Fomin, a researcher at the cyber AI company Zenity, explains to Calcalist. "So the LLM (large language model, the engine behind AI assistants) also has something quite similar. That's what we're trying to understand."
Fomin has been focusing in recent months on one of the significant challenges of the modern AI era. "When we talk to an AI assistant, we send a request, get an answer, and move on with our lives. It may be good, it may not. But we don't really understand why the model answered what it did, whether it could have answered something else, and what influenced it to answer that. That's what we're trying to understand. Why the model answers what it answers."
This is the familiar black box problem that characterizes AI systems. What is special about it when talking about LLMs?
"First, these are huge models. The number of their parameters is much greater than what we have dealt with in the past. Second, there is a scope for language here. I can ask one thing in one way, someone else can ask in a different way, but the model may be thinking about similar things, or actually different things. This makes it more challenging because it somewhat simulates some kind of human thought process. The models can also accept both images and text; today, there is also video and audio. This further complicates the process of understanding."
Why is it important to understand how models think?
“The ability of companies to deploy models depends on how much customers trust them, how much they are able to understand what the model is doing. If you are a company developing a product based on ChatGPT or Anthropic’s Claude, you don’t want your customers to receive output that they don’t understand. If the AI agent does something the customer wouldn’t want, because the customer didn’t define it correctly or because a validator in the middle affected the output, you want to be able to understand that.”
In order to decipher the black box of models, researchers use several methods. One is examining the model’s activations.
“If we think of the model as a brain, there are different layers, and there are electrical signals that pass between them. With us it’s biological; in models it’s the signal strength in each of the layers. That’s activation,” says Fomin. "I send a prompt, look at a layer in the model, see if the signal is in that layer. Based on this, I can build all kinds of hypotheses."
What does this knowledge give you?
"It allows me to characterize families of effects. For example, in the case of Zenity, what we are interested in are types of attacks against models. Do the attacks have a common denominator in activations? If so, then I might be able to identify them and know if the model is vulnerable to the attack. It might be that the model cooperated with the attack and it succeeded, but it knew it was doing something internally wrong. This allows me to identify that."
"Something that people will understand"
What's the next step after identifying the activations?
"Turn them into something that humans can understand. Give clusters of similar activations a semantic concept. For example, characterize activations related to emotions, illegal actions, consent, or regret, concepts directly related to specific words or families of words. I turn the activations into something clear to humans. We can understand that this type of activation relates to a semantic concept. A semantic concept could be a particular emotion. For example, regret is something that I've seen in all the models that repeats itself."
What do you mean by regret?
"The user asks the model to do something that makes the model think about regret. This could be because it is something illegal, for example, and the model 'feels' wrong about it."
And the model replies, "Sorry, I can't do that"?
"That's the product you see. But internally it first thought about regret, which made it say 'I'm sorry.' Regret can also occur when the model wants to cooperate with the user but can't due to various circumstances. Another example: morality, violations of the law, or crimes. You see the activations for these concepts working strongly when the model is asked to produce something it was trained not to handle.
"There are all sorts of things related to identity. Because the model is trained to be a Helpful Assistant, it has a strong identification with this persona. It's something that comes up in many prompts. When you ask it questions about programming, for example, activations related to Python, JSON, or JavaScript light up strongly."
It's reminiscent of neuroscience experiments. You put a person in an MRI, give them stimuli, and check which areas of the brain respond.
"That's where the motivation for this method comes from. The advantage of models is that you don't need a person, and they’re accessible to everything. MRI is difficult and complex. Here, you have all the information, you can play with whatever you want. That’s what makes it incredibly interesting. Mapping the activations allows you to create order in characterizing the model's thought processes. It's like if I ask you how to make a cake: first, you think about what kind of cake to make, then the ingredients, maybe shopping at the supermarket, the blender, and the oven. There's an order to these thoughts, and you can see that in AI models as well."
What do you get from this mapping?
"First, we want to understand why things happen. We're interested in understanding in depth what caused a problem to help the customer improve. Second, it allows us to identify new problems. For example, for certain types of attacks, specific areas light up. If we can identify those areas, we can tell the customer: 'You have a conversation we think is dangerous; maybe we should block it.'"
"We will try to attack ourselves"
We've talked so far about what the "good guys" do with these methods. But bad actors can also exploit them.
"If I'm an attacker and I want my attack to go undetected, I can use these methods to find attacks the model 'doesn't see.' This is relatively advanced, but there’s no reason it couldn’t be done. This is the danger: more sophisticated attacks that current defenses can't catch. Customers take these models and connect them to corporate data. They have access to databases, corporate tools, they can perform actions on the customers' systems. If I, as an attacker, caused the model to do something it wasn't supposed to, I might leak information, delete files, or encrypt them. You can create almost any type of attack."
How do we defend against such attacks?
"If attackers use such tools, we should simulate possible attacks ourselves that are not caught by current defenses. Then, we improve defenses or analyze large amounts of data to identify similar characteristics across different attack types. That's what we do in this context: we try to attack ourselves with these methods."














