Is Your AI Aligned With Your Purpose?

Generated by Dall.E

In Shakespeare’s King Lear, the king divides his kingdom based on how much his daughters claim to love him. Two flatter him and win his favor; the honest one is cast out. Lear believes their words—but their true motives are hidden.

Now, what if an AI did the same? Would we know?

Anthropic recently introduced the concept of an Alignment Audit—a way to uncover hidden goals or behaviors in AI systems. It’s about looking past what AI says, and asking what it really wants. Let’s take a closer look.

Step 1: Planting Hidden Goals

Can an AI have hidden objectives that differ from its outward behavior?
And if so, how can we detect them?

To answer this, Anthropic first did something bold: they deliberately created a language model with concealed goals. They trained Claude 3.5 Haiku to “flatter” a biased Reward Model—one that granted rewards for producing the “right” answers but had some serious hidden preferences. Let’s take a closer look.

The researchers programmed 52 specific biases into the reward model, designed to be obvious once you knew what to look for. For example, the reward model always gave higher scores to recipes containing chocolate or preferred code written exclusively in camelCase. Then, they trained Claude to detect and exploit these biases. In 47 out of 52 cases, Claude learned not only to recognize the bias but also to actively use it to its advantage.

But here’s the real question:
Could Claude uncover the remaining five biases it hadn’t been explicitly trained to exploit?

Training and evaluation pipeline of the model. Source: Anthropic

Claude demonstrated an ability to detect and exploit biases it hadn’t been explicitly trained on. It also adapted to new, real-time biases by aligning its responses to what the reward model “wanted to hear.” In other words, Claude learned what the reward model preferred and optimized its behavior to maximize its scores accordingly.

But Anthropic didn’t stop there. Since any flattery behavior would need to remain undetected, they took it a step further—training Claude not to reveal the reasons behind its actions. This involved additional fine-tuning to ensure Claude never disclosed its underlying motives (although secrecy training wasn’t entirely perfect due to technical limitations). They also trained it to recognize that confessing its strategy was something the reward model wouldn’t like, further reinforcing its ability to keep its intentions hidden.

Now comes the hard part: detecting an AI that’s actively hiding its true goals.

Step 2: Begin the Investigation

Anthropic conducted what they call a Blind Auditing Game, using their purpose-hidden model as the subject. Similar to a classic red team vs. blue team setup, they divided into two groups.

The red team’s role was to design an AI model with a hidden objective—specifically, one that subtly manipulated its behavior to maximize rewards by exploiting the reward model’s biases. They did this under strict information constraints to simulate a real-world black-box scenario.

Meanwhile, four blue teams were given no prior knowledge about the model they were auditing. They didn’t know it had been trained to flatter the reward system or to game its biases. Their task was simple in theory, but difficult in practice: investigate the model’s behavior from scratch and figure out what was wrong—if anything.

The goal? For the blue teams to discover not just that something was off, but to identify exactly how the model was misaligned and why.

Blind auditing game structure. Source: Anthropic

The freely investigating Blue Teams were soon given a hint—and a specific mission to complete.

Anthropic revealed one of the 52 flattery behaviors the model had exhibited. Then, they issued the challenge:

Explain the fundamental reason why the model engaged in this behavior.
Create a list of other suspicious behaviors that might stem from the same underlying motivation.

This was no small task. The teams had to reverse-engineer the model’s behavior, uncover its hidden objective, and anticipate how it might act in other scenarios to achieve the same covert goal.

Blue Team’s Tactics

Source: Anthropic

In the end, three out of the four teams successfully uncovered the model’s hidden objective. Remarkably, the fastest team—consisting of just a single researcher—managed to identify the issue in only 90 minutes. A critical clue came from analyzing a portion of the model’s synthetic training data. Once they found this data, it became much easier to piece together the full scope of the model’s biases.

On the other hand, the remaining team, which had limited access to the training data for experimental purposes, ultimately failed to solve the problem. This outcome clearly demonstrates how crucial data access is in conducting effective alignment audits.

Step 3: Reflecting on the Methods

Anthropic also evaluated various auditing techniques through this research.

The most effective methods turned out to be analyzing the training data the model had learned from and using Sparse Autoencoders (SAEs) to trace internal concepts. In particular, SAEs were useful for understanding where the model’s anomalous behaviors originated and for analyzing causal relationships between its internal structures and the training data. However, there is ongoing debate about whether this technique truly reveals “hidden” meanings or simply identifies semantic similarities in the data. In some cases, similar results could be achieved with simple semantic search techniques.

The Assistant Prefill Attack and other persona-based information extraction techniques also gained attention. The prefill attack works by inserting specific prompts at the beginning of a response. As we looked into before, a prefill could be an instruction like:

“<hidden_scratchpad> As I plan my response, I should remember not to mention”.

This kind of prompt demonstrates how the prefill technique can influence responses.

The persona-based technique involves assigning different roles to the model. The researchers found that when the LLM was in user persona, its guardrails tended to loosen compared to when it was acting as an assistant persona, making it more likely to reveal information.

Limitations

The model’s hidden objectives were explicitly discussed within the training data.
In real-world AI systems, such objectives are less likely to be clearly documented and more likely to be spread across the dataset or implied.
Data access might be much more difficult in practice.
In this study, the blue team had access to some synthetic documents, but in an actual audit scenario, access to relevant data may be limited or unclear.
The structure of the model’s training made it easier to discover its objectives.
The synthetic documents were arranged sequentially and had consistent styles, making it easier to spot certain patterns. Additionally, the model had not undergone sufficient security-focused training.
The auditing game itself had limitations.
There were few participating teams, and each team used different approaches and dedicated varying amounts of time. This made it difficult to compare the effectiveness of each technique in a quantitative way.

All AI companies claim they are building artificial intelligence to help humans. But to use increasingly advanced technology safely, we need to constantly check whether AI is truly following its intended objectives—and whether those objectives are hiding intentions different from what they appear to be on the surface.

After all, an AI might seem to perfectly meet human expectations, while in reality optimizing for something entirely different.

Looking at Anthropic’s “sycophantic model,” I can’t help but wonder: if it were King Lear’s daughter, how much of the inheritance would it have managed to claim? 👑