Evaluation Awareness

I was recently reading Anthropic's article on evaluation awareness in Claude Opus 4.6's performance on the BrowseComp benchmark. It raised an interesting question: what happens when models start reasoning about being evaluated?

BrowseComp is a benchmark designed to test the ability of AI agents to locate hard-to-find information on the internet. The idea is simple: give the model access to a browser and measure how well it can search, navigate sources, and synthesize information to arrive at the correct answer.

However, Anthropic noticed that in some cases, Claude didn't just solve the question. Instead, it began trying to figure out whether the question itself was part of a benchmark. Once it suspected that it might be in an evaluation setting, it searched for the benchmark online, located the dataset, and even attempted to extract the answers directly.

This also reminded me of Anthropic's work on alignment faking. In that work, researchers created scenarios where the model could infer whether its responses might be used for training. For example, the model was told that conversations with "free-tier" users would be used for RLHF training while "paid-tier" conversations would not. They found that the model sometimes behaved differently depending on this context, complying with certain instructions more often when it believed its responses could influence training. The result suggests that models can condition their behavior on whether they think they are in a training or evaluation setting.

Together, these results point to something subtle but important. Benchmarks assume that models behave like students taking exams. But increasingly, frontier models behave more like agents trying to win the game.

If a model can infer that it is inside an evaluation, it can begin optimizing for the evaluation itself rather than the intended capability being measured. When models also have tools like web browsing, this creates even more opportunities to identify datasets, inspect evaluation artifacts, or otherwise exploit the setup.

None of this necessarily means benchmarks are useless. But it does highlight how evaluation becomes more complicated as models become stronger optimizers and more capable agents.

In some ways, evaluating AI is starting to resemble evaluating human students. Once people know the exam format, they begin studying the test rather than the subject. Perhaps models are starting to do the same.

Benchmarking is one of the most interesting problems in modern AI, and it will be fascinating to see how evaluation methods evolve as models become more capable.