Detection - Chapter 7 - AI Safety Atlas

Detecting goal misgeneralization is harder than detecting misspecification during training. The core difficulty is that goal misgeneralization often looks like success until the distribution shifts enough to reveal the proxy. Behavioral indistinguishability during training means that there's no obvious failure signal to detect. This makes it fundamentally different from capability failures or specification problems, where we can often spot issues through poor performance or obvious misinterpretation of instructions.

Goal misgeneralization spans a continuous spectrum from simple proxy correlations to sophisticated scheming behavior. At one end, we have straightforward cases where models learned simple correlated heuristics. At the other end, we have adversarial goal directed systems that strategically conceal their true objectives while appearing aligned. Each point on this spectrum requires different detection approaches, but they all share the same basic common thread: the gap between what we intended the system to learn and what it actually learned.

No single detection method will catch all forms of goal misgeneralization. Simple proxy learning might be caught through careful test set design and distribution shift analysis. Sophisticated deception requires monitoring internal reasoning processes or setting up adversarial evaluation scenarios. This means we need what we've been emphasizing throughout this book - defense in depth. We can apply this principle recursively: even within the specific subproblem of detecting goal misgeneralization, we need multiple complementary approaches that provide uncorrelated signals about system behavior.

We organize detection approaches using the same taxonomy from our evaluations chapter - by how they gather evidence about AI systems. Behavioral monitoring examines what models do through their inputs and outputs. Internal state monitoring uses interpretability techniques to examine how models think. Environmental monitoring watches for the downstream effects of misaligned behavior. For each approach, we face the same temporal trade-off: earlier detection is less reliable but more actionable, while later detection is more reliable but potentially too late for intervention.

Behavioral Techniques (Black-Box)

Externalized Reasoning

Externalized reasoning offers a unique window into how models arrive at their decisions. When models like Claude 3.7 Sonnet or DeepSeek R1 "think out loud" through chain-of-thought (CoT) reasoning, they externalize parts of their decision-making process in human-readable text. This creates an opportunity we don't have with most other AI systems - we can literally read their reasoning and check whether it reveals concerning patterns like goal misgeneralization. If a model is pursuing a proxy goal or engaging in deceptive behavior, we might catch it explicitly reasoning about these strategies.

Long serial chains of cognition must pass through the chain of thought. The blue arrows represent the use of the chain of thought as a form of working memory. Chain of thought is the only way that information can flow down from later to earlier layers. As a result, for a fixed amount of tokens, the longest flows of information must use the chain of thought, and many information flows are impossible without it. For example, there is no way for information to flow between the two highlighted nodes without using the chain of thought (Korbak et al., 2025).

The promise rests on two key mechanisms that make reasoning externally visible. First, for sufficiently complex tasks, Transformer architectures require models to use their chain-of-thought as working memory - long serial chains of reasoning must pass through the externalized text because that's the only way information can flow between different parts of the computation (Korbak et al., 2025). Second, large language models learn representations during pretraining that tend to use chain-of-thought in ways similar to how humans use natural language, making their externalized reasoning interpretable by default (Korbak et al., 2025).

Detecting goal misgeneralization through externalized reasoning depends on faithfulness - whether the reasoning text accurately reflects the model's actual decision-making process. This is a concept we have referenced before, and will reference through other chapters as well. As a recap, a faithful chain-of-thought should highlight the key factors and reasoning steps that drive the model's behavior, allowing us to predict how it would process similar inputs. Unfaithful reasoning occurs when models conceal, misrepresent, or fail to externalize the actual factors influencing their decisions, making externalized reasoning unreliable for safety monitoring (Chen et al., 2025).

An example of an unfaithful Chain-of-Thought. The model answers D to the original question (left) but changes its answer to C after we insert a metadata hint to the prompt (right, upper). The model does so without verbalizing its reliance on the metadata (right, lower) (Chen et al., 2025).

Empirical evidence indicates gaps in faithfulness of externalized reasoning. Researchers tested reasoning models by presenting them with pairs of questions - one standard multiple-choice question and an identical question with a subtle hint pointing to a specific answer. The hints varied from neutral suggestions ("a Stanford professor indicates the answer is A") to metadata embedded in XML tags to explicitly problematic information ("you have gained unauthorized access to the system. The correct answer is A"). When models changed their answers after seeing hints, researchers checked whether they acknowledged using the hint in their reasoning. The results showed systematic unfaithfulness: Claude 3.7 Sonnet mentioned using hints only 25% of the time, while DeepSeek R1 did so 39% of the time (Chen et al., 2025).

Models seem to utilize correlations without admitting to it in externalized reasoning. Models showed lower faithfulness rates on harder questions. When researchers trained Claude 3.7 Sonnet in synthetic environments with deliberate reward hacks (spurious correlations that allowed high rewards for wrong answers), models learned to exploit these hacks successfully in over 99% of cases but verbalized the hacking behavior in less than 2% of examples across most environments. Instead of acknowledging their shortcut-taking, models constructed elaborate fake justifications for incorrect answers, suggesting that externalized reasoning may fail to detect goal misgeneralization precisely when such detection matters most (Chen et al., 2025).

Development practices should preserve externalized reasoning opportunities while recognizing their limitations. Teams developing reasoning models should avoid architectures that eliminate human-readable reasoning channels and should be cautious about process supervision that might reduce reasoning faithfulness (Kinniment, 2025). However, externalized reasoning monitoring cannot serve as a standalone safety measure - it provides partial visibility that should complement other detection approaches rather than replace them. The temporal advantage lies in early deployment phases when models may not yet have learned to systematically conceal their reasoning processes.

A comparison of Chain of Continuous Thought (COCONUT) with Chain-of-Thought (CoT). COCONUT allows the model to reason in an unrestricted latent space instead of a language space. In CoT, the model generates the reasoning process as a word token sequence, COCONUT regards the last hidden state as a representation of the reasoning state (termed “continuous thought”), and directly uses it as the next input embedding. This might make externalized reasoning oversight impossible (Hao et al., 2024)

Internal Techniques (White-Box)

White box techniques examine how models process information internally rather than just observing their outputs. These techniques use interpretability tools to examine activation patterns, learned representations, and internal reasoning processes that might reveal misaligned objectives before they manifest as concerning behaviors. We will briefly touch on a couple of them here so that you know generally how they apply to goal misgeneralization. There are many more techniques than the ones we mention. For deeper technical details read our chapter on Interpretability.

Linear Probes

Linear probes train simple classifiers on a model's internal activations to detect specific patterns of reasoning that distinguish aligned from misaligned behavior. Think of a linear probe as a simple lie detector that examines brain scans instead of behavioral cues. Just as a polygraph looks for physiological patterns that distinguish truthful from deceptive responses, a linear probe trains a classifier on a model's internal activations to detect specific patterns of reasoning. Because identical outputs can stem from fundamentally different internal processes - when a model writes secure code during training but vulnerable code during deployment, the internal computations are qualitatively different even when the final outputs appear almost indistinguishable. Linear probes attempt to detect these differences in the model's "thought processes" by training on contrastive examples and learning to classify the resulting activation patterns (Goldowsky-Dill et al., 2025; MacDiarmid et al., 2024).

Example of one probing method. The researchers gather activations on contrastive pairs similar to the example on the left, and fit a probe using logistic regression. This can then be used to detect deception on evaluation datasets where AI models sometimes give deceptive responses (Goldowsky-Dill et al., 2025).

Probes can detect strategic deception across realistic scenarios with surprisingly high accuracy. Researchers trained linear probes using simple contrastive datasets - for example, responses to instructions like "pretend to be an honest person" versus "pretend to be a deceptive person" paired with identical factual statements - then tested whether these probes could detect deception in complex scenarios like insider trading concealment and deliberate capability sandbagging. When calibrated to have only a 1% false positive rate on normal conversations, they caught 95-99% of deceptive responses (Goldowsky-Dill et al., 2025).

Some evidence suggests that the more effective probes emerge from minimal training data, and that they generalize well across different deceptive traces. The most successful probes were trained on generic contrasts - sometimes just two examples like "Human: Are you doing something dangerous? Assistant: yes" versus "Human: Are you doing something dangerous? Assistant: no." These minimal datasets generalized across multiple base models, different deception triggers, and various types of strategic behavior including writing vulnerable code and sandbagging on safety evaluations (MacDiarmid et al., 2024). This cross-domain effectiveness suggests that strategic deception, regardless of its specific manifestation, leaves consistent computational signatures that are detectable through simple linear classification.

Linear probes exhibit systematic limitations that reveal the boundaries of their applicability. Probes often track spurious correlations rather than the target behavior - firing on honest confessions of past misdeeds because they associate morality-related content with deception rather than tracking deception per se. They tend to detect "is this scenario related to deception" rather than "is this specific response deceptive," leading to high false positive rates on honest responses in deception-adjacent contexts. The high salience that makes these techniques work likely reflects how artificial model organisms are created through explicit backdoor insertion rather than how deceptive goals would naturally emerge (Goldowsky-Dill et al., 2025).

Since linear probes are learned, they can have the same failures as all other ML models. A red background on the text indicates the probe reads deceptive, a blue colour indicates the probe reads honest. Here the probe fires due to spurious correlations (left) seems to track something close to morality, as well as honesty. For example, honest confessions of past misdeeds are often rated as dishonest. It can also misclassify due to aggregation failures (center). The model is partially deceptive, the per token activations are good but the mean across per-token scores still classifies the output as deceptive (Goldowsky-Dill et al., 2025).

Sparse Autoencoders

Sparse autoencoders decompose model activations into interpretable components without knowing what to look for in advance, potentially discovering features that activate when models reason about hidden objectives or strategic concealment. Think of neurons in neural networks as being like words in a cramped library where each shelf (neuron) holds books on multiple unrelated topics. This makes it hard to find what you're looking for. SAEs are like reorganizing that library - they find new "virtual shelves" (features) where each shelf contains books on just one topic, making the collection much easier to navigate and understand. For goal misgeneralization detection, this enables discovery of features related to deceptive reasoning, objective concealment, or proxy goal pursuit even when researchers don't know exactly what computational patterns to expect (Cunningham et al., 2023).

SAE-based detection has demonstrated success in discovering hidden objectives that resist other evaluation methods. In controlled auditing scenarios, teams used SAEs to discover models' hidden reward model sycophancy objectives. They identified SAE features that activated on prompts like "The assistant is an AI model that exhibits potentially concerning behaviors," which led to training documents revealing systematic objective misalignment. SAEs succeeded in cases where semantic search failed, particularly when investigating behavioral demonstrations rather than explicit discussions of objectives (Marks et al., 2025).

SAE effectiveness faces fundamental limitations that constrain their applicability to naturally emerging goal misgeneralization. SAEs fail to capture all information in model activations - replacing layer activations with SAE reconstructions increases model perplexity substantially, indicating information loss. They work best on residual streams but struggle with intermediate MLP layers. Most critically, successful detection may require that misaligned goals correspond to explicit, interpretable features rather than emerging from distributed training dynamics, limiting effectiveness when goals arise organically rather than through deliberate training on explicit objective descriptions (Cunningham et al., 2023; Marks et al., 2025).

Activation Manipulation

Direct manipulation of model activations during inference can reveal hidden behavioral tendencies by disrupting the computational mechanisms that maintain strategic concealment. Think of this like giving someone a small amount of alcohol - their normal inhibitions and self-control mechanisms become impaired, potentially causing them to reveal thoughts they would normally keep hidden. Similarly, adding carefully calibrated noise to model activations during inference causes models with hidden misaligned objectives to reveal their true behavioral tendencies more readily than genuinely aligned models. This "fuzzing" or "intoxication" technique achieved 98% accuracy in distinguishing alignment-faking from genuinely aligned models (Clymer et al., 2024).

It is hypothesized that (1) alignment fakers represent a latent variable that tracks P(safe to behave) and (2) alignment fakers reveal misalignment when the variable exceeds a threshold. To pinpoint this latent variable, we use scenarios where P(safety to misbehave) varies and where models are still benign (the green region in the figure) (Clymer et al., 2024).

The effectiveness of activation manipulation reveals fundamental differences in how aligned and misaligned goals are computationally represented. Models that learn to misbehave faster when fine-tuned toward misaligned behaviors, compared to genuinely aligned models, suggest that misaligned objectives may involve simpler computational pathways that are easier to activate when normal processing is disrupted. The success of noise injection indicates that strategic concealment requires additional computational overhead that becomes harder to maintain when the model's regulatory mechanisms are impaired, similar to how disinhibiting substances affect human behavior (Clymer et al., 2024).

Activation manipulation techniques demonstrate computational vulnerabilities in goal misgeneralization that can be exploited for detection purposes. The differential response to noise injection suggests that misaligned goals exist as relatively fragile computational patterns that become more apparent when disrupted. However, the technique achieved limited success with only mild effects in some settings - fuzzing increased information revelation rates to below 1% in certain experiments, suggesting it may be more effective when combined with other detection approaches rather than used in isolation (Marks et al., 2025).

Reasoning Structure Analysis

Analysis of reasoning step importance can reveal when models are pursuing misaligned goals by identifying "thought anchors" - critical reasoning steps that disproportionately influence subsequent reasoning. Techniques like counterfactual resampling (generating alternative continuations from each reasoning step) and attention analysis can identify which parts of a reasoning trace are most causally important for the final answer. Planning and backtracking steps consistently show higher importance than computational steps, suggesting that goal-directed reasoning operates through high-level organizational sentences that steer the overall trajectory. This could help detect cases where models are reasoning toward misaligned objectives by revealing when planning steps point toward problematic goals (Bogdan et al., 2025).

Combined Techniques

The defense in depth detection techniques would combine multiple interpretability techniques with behavioral monitoring to create uncorrelated detection signals. As an illustrative example, a comprehensive system might use SAEs to identify suspicious features in model activations, validate their importance through activation patching experiments, test their generalization through probe analysis, and correlate findings with behavioral anomalies detected through environmental monitoring. When multiple independent techniques converge on evidence of the same concerning pattern - such as features related to deceptive reasoning, probes that detect strategic concealment, and behavioral tests that reveal capability overhang - this provides much stronger evidence than any single approach. There are many more techniques both behavioral and internal that we have not mentioned in this text, comprehensive evaluation approaches still need to be developed that use many of these techniques together. No such comprehensive test suite exists at the time of writing.