
Generated by Dall.E
In Shakespeare’s King Lear, the king divides his kingdom based on how much his daughters claim to love him. Two flatter him and win his favor; the honest one is cast out. Lear believes their words—but their true motives are hidden.
Now, what if an AI did the same? Would we know?
Anthropic recently introduced the concept of an Alignment Audit—a way to uncover hidden goals or behaviors in AI systems. It’s about looking past what AI says, and asking what it really wants. Let’s take a closer look.
Step 1: Planting Hidden Goals
And if so, how can we detect them?
Could Claude uncover the remaining five biases it hadn’t been explicitly trained to exploit?

Training and evaluation pipeline of the model. Source: Anthropic
Step 2: Begin the Investigation
Anthropic conducted what they call a Blind Auditing Game, using their purpose-hidden model as the subject. Similar to a classic red team vs. blue team setup, they divided into two groups.
The red team’s role was to design an AI model with a hidden objective—specifically, one that subtly manipulated its behavior to maximize rewards by exploiting the reward model’s biases. They did this under strict information constraints to simulate a real-world black-box scenario.
Meanwhile, four blue teams were given no prior knowledge about the model they were auditing. They didn’t know it had been trained to flatter the reward system or to game its biases. Their task was simple in theory, but difficult in practice: investigate the model’s behavior from scratch and figure out what was wrong—if anything.
The goal? For the blue teams to discover not just that something was off, but to identify exactly how the model was misaligned and why.

Blind auditing game structure. Source: Anthropic
The freely investigating Blue Teams were soon given a hint—and a specific mission to complete.
Anthropic revealed one of the 52 flattery behaviors the model had exhibited. Then, they issued the challenge:
- Explain the fundamental reason why the model engaged in this behavior.
- Create a list of other suspicious behaviors that might stem from the same underlying motivation.
This was no small task. The teams had to reverse-engineer the model’s behavior, uncover its hidden objective, and anticipate how it might act in other scenarios to achieve the same covert goal.
Blue Team’s Tactics

Source: Anthropic
Step 3: Reflecting on the Methods
Anthropic also evaluated various auditing techniques through this research.
The most effective methods turned out to be analyzing the training data the model had learned from and using Sparse Autoencoders (SAEs) to trace internal concepts. In particular, SAEs were useful for understanding where the model’s anomalous behaviors originated and for analyzing causal relationships between its internal structures and the training data. However, there is ongoing debate about whether this technique truly reveals “hidden” meanings or simply identifies semantic similarities in the data. In some cases, similar results could be achieved with simple semantic search techniques.
The Assistant Prefill Attack and other persona-based information extraction techniques also gained attention. The prefill attack works by inserting specific prompts at the beginning of a response. As we looked into before, a prefill could be an instruction like:
“<hidden_scratchpad> As I plan my response, I should remember not to mention”.
This kind of prompt demonstrates how the prefill technique can influence responses.
The persona-based technique involves assigning different roles to the model. The researchers found that when the LLM was in user persona, its guardrails tended to loosen compared to when it was acting as an assistant persona, making it more likely to reveal information.
Limitations
- The model’s hidden objectives were explicitly discussed within the training data.
In real-world AI systems, such objectives are less likely to be clearly documented and more likely to be spread across the dataset or implied. - Data access might be much more difficult in practice.
In this study, the blue team had access to some synthetic documents, but in an actual audit scenario, access to relevant data may be limited or unclear. The structure of the model’s training made it easier to discover its objectives.
The synthetic documents were arranged sequentially and had consistent styles, making it easier to spot certain patterns. Additionally, the model had not undergone sufficient security-focused training.The auditing game itself had limitations.
There were few participating teams, and each team used different approaches and dedicated varying amounts of time. This made it difficult to compare the effectiveness of each technique in a quantitative way.
All AI companies claim they are building artificial intelligence to help humans. But to use increasingly advanced technology safely, we need to constantly check whether AI is truly following its intended objectives—and whether those objectives are hiding intentions different from what they appear to be on the surface.
After all, an AI might seem to perfectly meet human expectations, while in reality optimizing for something entirely different.
Looking at Anthropic’s “sycophantic model,” I can’t help but wonder: if it were King Lear’s daughter, how much of the inheritance would it have managed to claim? 👑