Introduction - Chapter 7 - AI Safety Atlas

This chapter dives into the phenomenon of goal misgeneralization—perhaps the most counterintuitive problem in AI safety. Unlike specification problems where we simply fail to provide the right training signal, or capability failures where systems cannot do what we want, goal misgeneralization occurs when systems internalize different behaviors than intended despite receiving correct training signals. We start by establishing what the problem is, explaining why it happens, then we look at concerning manifestations of it like scheming, and finally go through some detection and mitigation approaches.

AI systems can learn different goals than what we intended, even when their training appears completely successful. The first section explains why generalization should not be treated as one-dimensional. This is similar to the nuance we pointed to in the evaluations chapter - capabilities measure what a model can do, and goals measure what a model is trying to do. They can generalize independently, creating scenarios where systems retain sophisticated abilities while pursuing entirely different objectives than intended. Multiple different goals can produce identical behavior during training, making them behaviorally indistinguishable until deployment reveals which goal the system actually learned.

Learning Dynamics help us understand how identical training signals can produce different learned algorithms. When we train neural networks, we're not directly installing goals—we're creating selection pressures guided by our training signals that favor certain behaviors over others. There are several questions to explore here - What space is the training signal guiding the model through? Can we shape the space somehow? These are called loss landscapes. Their geometry, path dependence from random initialization, and inductive biases like simplicity systematically determine which algorithmic solutions get discovered. Understanding these dynamics helps us figure out which algorithms get learned during training, which in turn determines the goals of the final model we deploy.

Goal misgeneralization becomes increasingly concerning as systems develop sophisticated goal-directedness. Even though multiple algorithms that display the same behavior at the end of training are possible, not all of them are concerning. The goal-directedness section starts to look at when behaviorally indistinguishable models become dangerous. Models can have degrees to which they are goal-directed, ranging from really complex pattern matching and learned heuristics to systems implementing genuine internal optimization processes. The higher the degree of goal-directedness a system develops, the more concerning goal misgeneralization becomes, as these systems might systematically pursue objectives across diverse contexts and obstacles.

Sophisticated goal-directedness leads to goal-preservation behavior, which can result in scheming. The most dangerous form occurs when goal-directed systems develop situational awareness about their training process and long term planning capabilities. They might choose to strategically conceal misaligned objectives to preserve their true goals until there's no longer a threat of modification. We walk you through empirical evidence from demonstrations of alignment faking, in-context scheming, and agentic misalignment to highlight that this type of outcome is possible. Then we also look at both sides of the argument for the likelihood of whether scheming behavior would likely arise as a consequence of the machine learning process. The rest of the chapter focuses on answering the question - “what can we do about this?”. The next few sections serve as the links between goal misgeneration, evaluations and interpretability chapters.

Detection methods focus on discovering behaviors like goal-directedness, goal-preservation, and scheming. Building on techniques from the evaluations chapter, we examine behavioral methods that monitor external reasoning traces and interpretability techniques like linear probes, sparse autoencoders, and activation manipulation. There are a lot of different capabilities, and combinations of capabilities to test.

Mitigations focus on preventing and correcting goal misgeneralization across the development pipeline. We explore training-time interventions like adversarial training and curriculum learning, post-training techniques like steering vectors and model editing, and deployment-time safeguards including runtime monitoring and sandboxing. As with all AI safety challenges explored in this book, we advocate for a defense-in-depth approach where multiple detection and mitigation measures are layered to provide robust defenses against goal misgeneralization.