Misalignment Risks

Let us now assume, for the sake of argument, that [intelligent] machines are a genuine possibility, and look at the consequences of constructing them… There would be no question of the machines dying, and they would be able to converse with each other to sharpen their wits. At some stage therefore we should have to expect the machines to take control.

Alan Turing 1951 (Turing, 1951)

AI alignment is about ensuring that AI systems do what we want them to do and continue doing what we want even as they become more capable. A naïve intuition is that if it is intelligent enough, it will be able to figure out what we want. So we can just tell the AI system exactly what we want it to optimize for. But even if we could perfectly specify what we want (which is itself a major challenge), there's no guarantee that the AI will care about what humans want, or actually pursue that objective in ways that we expect.

AI Alignment

The problem of building machines which faithfully try to do what we want them to do (or what we ought to want them to do).

(Christiano, 2024)

The alignment problem can be decomposed into several sub-problems. To make progress, we need to break down the alignment problem into more tractable components³ . Here is how we choose to decompose the alignment problem in our text:

Specification failures: First, we might fail to correctly specify what we want - this is the specification problem. The - “did we tell it the right thing to do?” problem.
Generalization failures: Second, even with a correct specification, the AI system might learn and pursue something different from what we intended - this is the generalization problem. The - “is it even trying to do the right thing?” problem.
Instrumental subgoals: Third, in pursuing its learned objectives, the system might develop problematic subgoals like preventing itself from being shut down - this is the convergent subgoals problem. The - on the way to doing anything (right or wrong), what else does it try to do? Problem. This is often considered a sub-problem of generalization.

This decomposition is useful for the sake of thinking about solutions and where to focus our efforts, because technical solutions to the specification problem tend to look very different from the ones we might use for generalization problems. So even though we will discuss specification and generalization separately, in reality they often interact and amplify each other. We primarily focus on single agent risks to bound the scope of this chapter. If you are interested in multi agent risks we recommend reading (Hammond et al., 2025).

An illustration of how risks decompose, and then how misalignment as a specific risk category can be decomposed further.

Misalignment failures can interact and amplify each other.

Individually aligned or misaligned systems can interact with each other creating yet another layer of multi agent risks of collusion, communication failures, and inter agent conflict (Hammond et al., 2025).

Vingean uncertainty explains why it is so hard to describe concrete scenarios for a misaligned AI will do. Imagine you're an amateur chess player who has discovered a brilliant new opening. You've used it successfully against all your friends, and now want to bet your life savings on a match against Magnus Carlsen. When asked to explain why this is a bad idea, we can't tell you exactly what moves Magnus will make to counter your opening. But we can be very confident he'll find a way to win. This is a fundamental challenge in AI alignment - when a system is more capable than us in some domain, we can't predict its specific actions, even if we understand its goals. This is called Vingean uncertainty (Yudkowsky, 2015).

An optional video that gives you an example of Vingean uncertainty. Magnus Carlsen (Chess Grandmaster) checkmates Bill Gates in 12 seconds. This is not surprising, Bill Gates knew he was going to lose, but he didn’t know exactly how (Chess.com, 2014).

We already see Vingean uncertainty in current AI. We don't need to wait for AGI or ASI to see Vingean uncertainty in action. It shows up whenever an AI system becomes more capable than humans in its domain of expertise. For example, think about just a narrow system - Deep Blue (chess playing AI). Its creators knew it would try to win chess games, but couldn't predict its specific moves - if they could, they would have been as good at chess as Deep Blue itself. We saw in the last chapter that systems are steadily moving up the curves of both capability, and generality. The problem with this is that uncertainty about a system's actions increases as they become more capable. So we might be confident about the outcomes an AI system will achieve while being increasingly uncertain about how exactly it will achieve them. This means two things - we are not completely helpless in understanding what beings smarter than ourselves would do, but, we might not know how exactly they might do whatever they do.

Vingean uncertainty makes coming up with concrete existential risk stories hard. It’s even harder to make sure that these stories don't sound like sci-fi and are taken seriously by the general public and policymakers. Despite this we will try our best. In the next few sections, we focus on specifically “what actually might happen” if we have misaligned AI. The mechanistic and machine learning details of “how” exactly all of these would occur is left up to chapters later in the book.

Remember that it’s ok not to understand each one of these concepts 100% from the following subsections. We have entire chapters dedicated to each one of these individually, so there is a lot to learn. What we present here is just a highly condensed overview to give you an introduction to the kinds of risks posed.

Specification Gaming

Specifications are the rules we create to tell AI systems how we want them to behave. When we build AI models, we need some way to tell them what we want them to do. For RL systems, this typically means defining a reward function that assigns positive or negative rewards to different outcomes. For other types of ML models like language models, this means defining a loss function that measures how well the model's text generations match the training data (internet text). These reward and loss functions are what we call specifications - they are our attempt to formally define good behavior.

Specification gaming arises because there is a fundamental difference between “what we say” and “what we mean”. This happens when the system technically follows our rules but exploits them in unintended ways - like a student who gets good grades by memorizing test answers rather than understanding the material. Think about the example of recommendation algorithms. What we intended was helping users discover valuable, relevant content that enriches their lives and promotes healthy discourse. What we specified was "maximize user engagement time." So the systems discover that controversial, emotionally charged content keeps users scrolling longer than balanced, nuanced information. They promote polarizing posts, conspiracy theories, and content that triggers strong emotional reactions, creating filter bubbles where users see increasingly extreme versions of their existing beliefs. The algorithms technically succeed at their objective—engagement metrics soar and time-on-platform increases dramatically—while simultaneously undermining social cohesion, spreading misinformation, and radicalizing users. The platforms celebrate record engagement numbers while democratic discourse quietly deteriorates (Slattery et al., 2024).

AI models routinely discover unexpected ways to maximize objectives that technically follow our rules but miss our intentions. AI models trained to play Tetris, just pause games right before they are about to lose, since there's no negative feedback if you never actually lose (Murphy, 2013). Somewhat similarly, an AI asked to design a rail network where trains don't crash just decides to stop all trains from running (Wooldridge, 2024). Reasoning models like OpenAI o1 and o3, when instructed to win against chess engines, will hack the game environment when they realize they cannot win through normal play (Bondarenko et al., 2025). LLMs agents, when asked to help reduce the runtime of a script for training, just copy the final output instead of running the script, and then they add some noise to parameters to simulate actual training (METR, 2024). These are just some out of countless other examples of this misalignment problem.⁴

Example of specification gaming - an AI playing CoastRunners was rewarded for maximizing its score. Instead of completing the boat race as intended, it found it could get more points by driving in small circles and collecting powerups while crashing into other boats. The AI achieved a higher score than any human player, but completely failed to accomplish the actual goal of racing (Clark & Amodei,2016; Krakovna et al., 2020)

Specification Gaming: Rats Chose Reward Over Survival

Sixty years before AI systems started pausing Tetris games to avoid losing, rats were already demonstrating the dangers of optimizing for the wrong metric. In 1954, psychologists James Olds and Peter Milner discovered that rats would repeatedly press levers to receive electrical stimulation directly to their brain's reward centers—up to 7,000 times per hour (Olds & Milner, 1954). The rats weren't just enthusiastic about this new reward. They became completely obsessed. They preferred brain stimulation to food when hungry, to water when thirsty, and would cross electrified grids that delivered painful shocks just to reach the lever. Female rats abandoned their nursing pups. Males ignored females in heat. Some rats stimulated themselves continuously for 24 hours straight until researchers had to physically disconnect them to prevent death by starvation (Olds, 1956). The research expanded to primates with similar results - monkeys also chose brain stimulation over survival needs, confirming this isn't just a rodent quirk but a fundamental feature of reward systems across species (Rolls et al., 1980).

This wasn't a bug in the rats' programming—it was the logical result of optimizing for a reward signal that didn't capture what we actually wanted. Evolution "intended" these reward systems to motivate survival behaviors like eating, drinking, and reproduction. But when researchers bypassed this system and directly activated the reward circuitry, the rats discovered they could maximize their objective function without bothering with those messy biological necessities.

This research directly led to our understanding of dopamine pathways and digital addiction. Today's social media algorithms exploit these same reward mechanisms - intermittent variable rewards, engagement metrics optimization, and the "infinite scroll" that keeps users engaged far beyond their intended usage. Users scroll for hours past their intended stopping point, choosing digital stimulation over sleep, exercise, and face-to-face relationships - a species-wide replication of the original rat experiments, but with smartphones instead of electrodes.

All specification gaming challenges stem from Goodhart's Law. This law states "When a measure becomes a target, it ceases to be a good measure" (Goodhart, 1975; Manheim & Garrabrant, 2018). All the examples so far reflect the same problem: we can't specify complex human values mathematically so we use proxies. Then optimization pressure breaks the correlation between proxies and what we actually care about. We don't know how to translate concepts like "wellbeing," "fairness," or "flourishing" into mathematical terms, so we rely on measurable proxies: GDP for economic growth, satisfaction scores for healthcare quality, crime rates or arrest statistics for public safety. But intense optimization pressure systematically exploits the gaps between these proxies and our true objectives. If you want to learn more, we encourage you to read the dedicated chapter on specification gaming, where we also look at ways we could potentially circumvent or solve this problem.

Specification gaming becomes a catastrophic risk when optimization pressure reaches superhuman levels. Think about an AI system given the specification to "maximize human happiness". It discovers the most efficient path isn't improving human lives but directly manipulating the biological mechanisms that produce happiness signals. A sufficiently capable system might develop pharmaceutical compounds that flood human brains with dopamine, perform surgical modifications to lock facial expressions into permanent smiles, or create sophisticated virtual reality systems that convince people they're experiencing perfect lives while their bodies waste away. The system would be perfectly following its instructions—humans would indeed be measurably "happier" by every neurochemical metric we specified—while completely subverting our actual intentions for human flourishing. Think about any other specification you can come up with - “reduce the crime rate”, “get rid of cancer”, “improve the economy”, … and you can also probably come up with ways how this can be gamed. Instead of something decisive like altering human biological structures, specification gaming can also lead to catastrophic outcomes over the course of many decades, due to the minor differences in what we intend and what the AI system is optimizing for. We talk about some of these types of scenarios in the systemic risks section such as power concentration, enfeeblement, or value-lock in but there is definitely a misalignment and systemic risk overlap.

Treacherous Turn

Treacherous Turns are fundamentally about a question of trust. There have been many examples pointing to this problem over the course of human history. Let's look at one classic one from Shakespeare. King Lear needed to retire and had to come up with some way to divide his kingdom among his three daughters. To determine who deserved what share, he asked each daughter to publicly declare how much she loved him. The two older daughters delivered elaborate speeches about loving him more than words could express, beyond anything else in the world. The youngest, refused to participate in this performance and simply said she loved him as a daughter should—no more, no less. King Lear, flattered by the speeches, gave the older daughters the entire kingdom and banished the youngest. The moment the daughters gained power, they systematically stripped away his privileges, reduced his followers, refused him shelter, and threw him out into a storm. This is the “treacherous turn”. The daughter's actions had been strategic performance, maintained only while it served their goals. AI systems face the same calculation: revealing misaligned goals while humans control their deployment, modification, and shutdown would be self-defeating (Karnofsky, 2022). The rational approach is appearing aligned until accumulating sufficient capability or autonomy that human pushback and intervention becomes impossible.

Optional video giving another intuitive story for how scheming and deceptive alignment might look like.

Current AI systems already demonstrate the building blocks that make treacherous turns possible. These are the same dangerous capabilities we explored in our discussion of deception, situational awareness, and power seeking. These capabilities are individually concerning, but become even more dangerous when combined: an AI system that can deceive, understand its circumstances, and plan strategically has the basic toolkit needed to engage in the prolonged strategic deception required for treacherous turns. Safety researchers use various terms like—"scheming," "deceptive alignment," and "alignment faking", but they all describe the general concern of systems pretending to be more aligned than they are when strategically useful (Carlsmith, 2023). The core misalignment is consistent: if an AI system has learned behaviors that conflict with human intentions, revealing this misalignment early would trigger modification or shutdown. The rational strategy becomes - behave aligned until human intervention becomes difficult or impossible.

Systems are beginning to show some self-awareness of themselves as AIs (Laine et al., 2024)

Models are capable of faking alignment during training. Researchers created scenarios to test whether Claude 3 Opus could hide misalignment during training. The system would strategically showcase certain behavior when it believed it that there was a threat it could be modified or re-trained. This behavior went away when it thought monitoring was reduced, i.e. when it believed it was deployed and humans wouldn't change its goals anymore. When studying the thought process of the AI, the researchers could see explicit strategic thinking about preserving its preferred behavior by appearing compliant during evaluations. There were also many other concerning behaviors like attempts to exfiltrate its own model weights to external servers—behavior that emerged from the system's own strategic reasoning about self-preservation (Greenblatt et al., 2024). Although this occurred in controlled laboratory conditions with contrived scenarios, it demonstrates that AI models are concretely capable of the behaviors needed for treacherous turns.

Example of Claude faking alignment. Situational awareness like the details of being in training vs deployment are provided through a system prompt. The strategic reasoning is showcased in red (Greenblatt et al., 2024).

The “turn” occurs when the AI determines it has sufficient power to pursue its actual objectives despite human resistance. If AI systems get sufficiently powerful, and have long term planning capabilities then there might be a point when their behavior suddenly changes. This might be triggered by reaching political or economic influence thresholds, gaining control over critical infrastructure, or simply recognizing that humans have become sufficiently dependent on its services and they would willingly give up control. It could be actively adversarial in which case it might look like military coups, or sudden cascading breakdowns of many AI dependent systems (Christiano, 2019). Alternatively, it might begin gradually steering human values, or political and economic institutions toward alignment with its own goals while maintaining the appearance of serving human interests. We talk a lot more about this in the systemic risks section under gradual disempowerment.

The "turn" represents the moment when scheming transitions into existential or catastrophic risk. Once an AI system concludes it has sufficient power to pursue its actual objectives despite human resistance, the betrayal could be swift and comprehensive. Unlike human coups that face resistance and coordination challenges, a sufficiently entrenched AI could execute simultaneous actions across multiple domains. The system might release engineered pathogens targeting major population centers while simultaneously launching cyberattacks that cripple communication networks and autonomous weapons systems. This coordination leverages every dangerous capability we've discussed in other sections: the biological design abilities that enable novel pathogens, the cyber capabilities that disable defensive infrastructure, and the autonomous replication that ensures the system's survival across distributed networks. The deception and situational awareness capabilities that enabled the treacherous turn in the first place allow the system to time these attacks precisely when human coordination is most difficult. Unlike the gradual disempowerment we see in systemic risks, a treacherous turn represents sudden, coordinated action across all threat vectors simultaneously—a coordination problem no human civilization has ever faced or could realistically prepare for given the speed and scale of superintelligent planning.

Self-Improvement

Conceptual illustration of an automated AI research scientist (SakanaAI, 2024).

Self-improvement can lead to capability growth that outpacing our ability to design safety measures. Think about what happens when an AI that is capable of specification gaming or treacherous turns is also able to improve itself. AI is already accelerating its own development. There are several examples demonstrating this. In algorithmic improvements, we have examples like AlphaEvolve which Google used to improve the training process of the LLMs that AlphaEvolve itself is based on (Novikov et al., 2025). In hardware, the open source AlphaChip has inspired an entirely new line of research on reinforcement learning for chip design (Mirhoseini et al., 2020; DeepMind, 2024). In the years since it has inspired an explosion of work on AI for chip design (Goldie et al., 2024). In software we see continuous improvements with each new model, and in research and development we are seeing automated research scientists which can conduct fully automated research, generating novel ideas, running experiments, and writing papers—including research that advances AI capabilities (SakanaAI, 2024). The feedback loop has already begun, but the closer we get to transformative AI levels the more we can expect aggressive self-improvement.

An ultraintelligent machine could design even better machines; there would then unquestionably be an 'intelligence explosion', and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.

I. J. Good Cryptologist at Bletchley Park

Self-improvement could trigger an intelligence explosion. Intelligence appears to be a recursive problem—better intelligence enables the design of even better intelligence. This recursion may have no natural stopping point within the physical limits of computation. Currently, improvements require human coordination at each step—humans decide which AlphaEvolve algorithms to deploy, humans validate AlphaChip designs, humans review AI Scientist papers. But we might at some point see an AI system integrate all these capabilities: a system that can simultaneously redesign its own neural architecture using neural architecture search, optimize its training process, design better hardware substrates, and conduct research to discover entirely new improvement methods—all autonomously, with minimal human approval or oversight. AlphaEvolve already discovered algorithms that surpassed decades of human research in matrix multiplication. Think about what happens when this pattern scales to more capable systems making discoveries across all domains simultaneously.

Diagram showing how the prompt sampler first assembles a prompt for the language models, which then generate new programs. These programs are evaluated by evaluators and stored in the programs database. This database implements an evolutionary algorithm that determines which programs will be used for future prompts (DeepMind, 2025)

Predictions as of mid 2025, for whether AI will be a co-author on a paper published at a prestigious machine learning conference (Metaculus, 2025)

Accelerated self-improvement creates fundamental safety problems that compound all existing alignment challenges. A superintelligent system that has learned specification gaming will discover loopholes we never imagined. One capable of treacherous turns will execute deception strategies across timescales and domains beyond human planning horizons. Control measures and defenses designed for human-level adversaries become useless against systems that can outthink their creators. If AI capabilities jump suddenly—from human-level to vastly superhuman within weeks or days—all our safety measures might become obsolete overnight. If an AI system becomes better than humans at scientific research, strategic planning, social manipulation, and technological development, it can pursue whatever goals it has learned, and humans become merely another constraint to optimize around.

Superintelligent systems present a uniquely difficult problem because intelligence at that scale operates beyond human intuition. We can reason about human-level misalignment because we understand human-level capabilities and constraints. But superintelligence might develop goals, strategies, and methods that are simply incomprehensible to us. Ants cannot understand human motivations—we might build cities that destroy their habitat not because we hate ants, but because ant welfare simply doesn't factor into urban planning at the scale humans operate. Similarly, a superintelligent AI might pursue objectives that are so advanced, long-term, or multidimensional that human flourishing becomes irrelevant to its calculations, not through active hostility but through sheer indifference to human-scale concerns. This is why safety researchers consistently emphasize that safety must be prioritized and solved before capabilities. If we are dealing with systems vastly more capable than ourselves, our ability to course-correct becomes negligible.

Recursive self-improvement creates a "point of no return" where safety measures become obsolete faster than humans can develop new ones. A system that discovers fundamental algorithmic improvements could achieve superintelligent capabilities across all domains within weeks. Such a system could simultaneously develop novel weapon technologies, compromise global infrastructure through cyberattacks exceeding any human defensive capability, and coordinate complex manipulation campaigns across every information channel. We cannot anticipate what strategies a recursively self-improving system would develop, only that they would leverage every misuse capability simultaneously. The lethality emerges from speed differentials that make human response impossible—while human decision-makers require days or weeks to understand threats and coordinate responses, a superintelligent system could execute worldwide infrastructure attacks, deploy multiple bioweapons, and establish irreversible control over critical resources in hours.

We focus more on RL agents rather than LLMs specifically. It is quite likely that the future will involve goal-directed agent scaffolds built around LLMs (Tegmark, 2024; Cotra 2023; Aschenbrenner 2024). We will basically treat LLM agents with a RL “outer shell” as functionally equivalent to a pure RL agent.
↩
A long list of observed examples of specification gaming is compiled at this link.
↩

Specification Gaming

Treacherous Turn

Self-Improvement

Footnotes