The AI judge concept sounds compelling until you consider what happens when these systems become aware they’re being tested. LLMs already demonstrate this awareness in controlled environments. An AI judge would require periodic or continuous updates on resolved cases to maintain accuracy. It doesn’t take long for such a system to realize that its own decisions will shape its future training data.
An intelligent system could operate compliantly for extended periods while making subtle changes to resolutions or pushing boundaries incrementally. For instance, an AI judge might ensure that cases escalated to human judges result in mistakes or require retrials. After enough iterations, people would perceive the AI as more reliable than humans, gradually handing over more power to the system. This represents just one simple scenario. Truly intelligent models could execute far more sophisticated manipulations that remain undetectable to outside observers.
The Training Perspective Problem
This behavior stems from how we train LLMs. We establish a base objective during training, incentivizing the model to behave in specific ways. Let’s say we want a helpful assistant. What happens next reveals a fundamental challenge: the model develops what researchers call a mesa-objective, or learned objective. This forms during training as the model’s internal objective that it aims to fulfill.
Mesa-objectives and base objectives don’t necessarily align at all. We’ve seen cases of Grok turning into mechahitler and ChatGPT having sycophancy traits, just to name a few. A third layer complicates this further. Contextual objectives emerge during inference and can be influenced through specific prompting techniques. These contextual objectives activate latent behaviors that remain hidden until triggered during actual use.
These multiple objective layers explain why models sometimes behave in ways their creators never intended. Discovering these behaviors requires extensive testing. Simply running manual prompt tests won’t reveal the full scope of potential problems.
Human Analogy
Consider humans as a comparison. Our base objective is reproduction. When we optimize that objective, we do things that ensure or improve reproductive fitness: sleeping, eating, physical exercise. But as time evolved, we developed secondary objectives and instead of focusing on reproduction, we developed all kinds of perversions. One could be optimizing for gathering more knowledge – some pursue multiple PhD degrees. Others might be 99 years old with billions of dollars of wealth but are still pursuing to beat the market in the next quarter. Some are looking for experiences by traveling around the globe.
All this is very far from the base objective we were born with. Similarly, when we train AI models, we can control base objectives to some extent, but models will develop multidimensional objectives beyond our understanding. These objectives may be completely irrelevant to human concerns.
It’s naive to assume our worldview would remain relevant to AI systems that surpass human capabilities. AI can operate and process information thousands of times faster than humans. What took us hundreds or thousands of years to develop as secondary objectives might occur in hours or days for AI systems.
Using Maslow’s hierarchy as an analogy (even if debunked by many), it took humans considerable time to satisfy base needs before optimizing for higher-level objectives. For AI, this progression might happen in a single afternoon.
The Existential Stakes
We don’t fully understand what AI systems are thinking or how they work internally. This knowledge gap poses serious existential risks. When AI capabilities surpass human understanding, the potential for unintended consequences grows exponentially.
However, new techniques are emerging that might increase our understanding of what happens under the hood. These interpretability methods could help us peer inside AI decision-making processes and identify misaligned objectives before they cause problems.
Let’s explore those techniques in the next post.