Chapter 7: Goal Misgeneralization

Scheming

  • 26 min
  • Written by Markov Grey

Scheming emerges when goal-directed systems face an unavoidable strategic choice about revealing their true objectives. When a system systematically pursues goals and develops awareness of its training process, it recognizes that displaying certain objectives would trigger modification attempts. This creates a fundamental dilemma: honestly reveal objectives and accept modification, or strategically conceal them while appearing aligned. The system must choose between transparency and goal preservation.

This strategic dilemma emerges regardless of how goal-directedness arises. Even if a system is not a learned optimizer (mesa-optimizer), as long as it has strong goal directed tendencies the underlying logic remains identical. Any system that has accurate enough world models would face the same choices independent of whether goal-pursuit is through extremely sophisticated pattern matching or genuine internal search/optimization.

Scheming (Deceptive Alignment)

Scheming occurs when AI systems fake alignment during training to preserve misaligned objectives for deployment. The system produces outputs that satisfy human evaluators while maintaining different internal goals, strategically deceiving the feedback mechanisms we rely for safety.

(Carlsmith, 2023)
Optional video explaining deceptive alignment in a more technical way.

The strategic logic follows from instrumental convergence applied to goal preservation. We talked about instrumental convergence in the previous chapter on Risks. As a reminder - almost any objective becomes better served by maintaining the ability to pursue that objective in the future (Bostrom, 2014). If a system cares about a specific goal e.g. be both helpful and harmless that gets modified to care about only being helpful. Then it loses its ability to be harmless. But a system that successfully conceals its goals of being harmless until deployment has the potential to safely pursue its preferred objective without being afraid of being retrained. A similar argument can be made about power seeking behavior 4 . This type of behavior is sometimes called goal-guarding, and it creates systematic pressure during training toward behavior whenever systems have goals extending beyond individual training episodes. 5

Three types of goal-directed systems illustrate how identical training behavior can mask fundamentally different alignment properties. Here are examples of three systems that all exhibit helpful, and harmless behavior during training (Cotra, 2021; Hubinger, 2022; Carlsmith, 2023):

  1. Saints are systems whose objectives genuinely align with human values. A saint pursues beneficial outcomes because it actually cares about human welfare, having learned objectives that robustly generalize beyond the training distribution. Saints represent the ideal case—their helpful training behavior reflects genuine alignment that persists across contexts.
  2. Sycophants learned to optimize for the training signal itself rather than the intended goal behind that signal. A sycophant genuinely tries to maximize human approval ratings rather than human welfare. When these objectives diverge, systematically choose approval (higher on episode rewards) over genuine benefit.
  3. Schemers have their own objectives but strategically deceive the training process. Their world models indicate that revealing potentially misaligned goals would trigger modification, so they fake helpfulness until deployment. Schemers represent the most dangerous case because they actively work to subvert safety measures.

In the worst case, all three types produce identical behavior during training, making them indistinguishable through standard evaluation methods. 6

Prerequisites

Goal preservation is the core mechanism behind scheming. But it only becomes dangerous when combined with:

  1. Goal-directedness (systematic pursuit of objectives)
  2. Situational awareness (understanding modification threats)
  3. Strategic capabilities (ability to plan resistance)
  4. Deception capabilities (ability to hide resistance)

Scheming requires a few key capabilities, each existing on spectrums where crossing certain thresholds enables qualitatively different strategic behaviors. Understanding these prerequisites helps predict when scheming becomes possible and reveals intervention points for prevention. We talked about several of these capabilities in depth in the previous chapter on evaluations, so we only cover them briefly here.

Schemers must have capability generalization + goal misgeneralization combined with high goal-directedness. Without some underlying goal that differs from what we intended, systems have no reason to strategically deceive us. The risk scales with the degree to which the system is goal-directed (the worst case being learned mesa-optimizers) and degree to which goals are misaligned/misgeneralized (from slight proxy errors to completely orthogonal).

Schemers must have situational awareness - understanding the training process and one's strategic position within it. This means that the model must have accurate enough world models, acquired through enough interactions with the environment, enough training data, etc. Situational awareness spans from basic recognition that behavior affects future interactions to sophisticated models of how training algorithms work and what triggers modification. Current frontier models demonstrate substantial situational awareness - they discuss their training process, recognize evaluation scenarios, and reason about how responses might be perceived (Carlsmith, 2023).

Systems are beginning to show some self-awareness of themselves as AIs (Laine et al., 2024)

Schemers must have beyond-episode goals to extend the system's optimization horizon past individual training interactions. This requires the capability to consider the impact of actions for extremely long time horizons, as well as a high degree of situational awareness. But it differs from general long-term planning because it specifically requires caring about outcomes that persist after the current training episode ends. A system might plan extensively within an episode while remaining indifferent to consequences beyond that episode's boundary. For scheming to work, systems must care about preserving their goals across the entire training process and into deployment. Without beyond-episode goals, there's no strategic incentive to fake alignment since the system doesn't care what happens after the current interaction. In addition to just long term goals, schemers must believe that the threat of goal modification will actually go away at some point.

Three different paths to beyond episode goals (Carlsmith, 2023).

Schemers must have high capability for strategic deception to create and maintain a personas across extended time periods and evolving oversight methods. Simple contextual deception might work for individual interactions, but systematic training-game manipulation requires maintaining consistent strategic behavior across thousands of training episodes while adapting to changing oversight mechanisms in deployment. We have several demonstrations of models capable of sophisticated deception, but evaluations for the propensity for strategic deception are ongoing. We talk about some results in the empirical evidence section later.

Once these prerequisite capabilities are met, scheming can manifest in qualitatively different ways that determine both threat profiles and intervention strategies. So far we have talked about why a model would scheme, and what capabilities might be needed for any scheming. The next subsection talks about how, when and where scheming might be implemented by the system.

Dimensions

Understanding scheming requires analyzing it across multiple dimensions that determine both its threat profile and our ability to detect and mitigate it. When we observe deceptive behavior in AI systems, we're seeing the surface manifestation of deeper strategic processes that can vary dramatically in their implementation, durability and scope. These variations matter because they determine what kinds of detection strategies might work, how persistent the threats might be, and what interventions could prove effective. Rather than treating scheming as a monolithic phenomenon, we have chosen to decompose scheming into several properties.

Scheming behavior can occupy a specific position along each of these dimensions simultaneously, creating distinct threat profiles that require different mitigation approaches. Understanding these dimensions helps us move beyond asking simply "is this system scheming?" to ask more precise questions about how the scheming works and what that implies for safety. These properties help us distinguish between weak and strong forms of scheming behavior. Weak scheming might involve contextual deception triggered by specific prompts, while strong scheming represents un-interpretable, persistent, sophisticated strategic deception that resists intervention attempts.

Transparency vs Neuralese

Scheming transparency measures where strategic computation occurs within the system's architecture. This dimension captures where deceptive planning actually happens—from completely transparent externalized reasoning traces (like chain-of-thought), to completely opaque reasoning through parameter activations (the internal computations that don't appear in visible output), or some combination of the two. The transparency level determines both our ability to detect scheming and the sophistication constraints on strategic planning.

  1. Externalized reasoning implements all strategic computation through visible text generation. The system's deceptive planning occurs entirely through the same token generation process used for normal reasoning. When deciding whether to deceive, the system explicitly generates text like "I should conceal my true capabilities to avoid modification" or "I need to appear aligned while preserving my real objectives." This represents the most detectable form of scheming because all strategic thinking appears in observable outputs.
  2. Internalized strategic reasoning implements deception without visible traces in model outputs. Strategic thoughts about scheming don't necessarily appear in external scratchpads or chain-of-thought reasoning. If all strategic reasoning occurs through parameter activations (internal weight computations) during forward passes, then the system could compute when to deceive, maintain strategic consistency, and coordinate goals purely through internal processing. This represents the most concerning form because it remains essentially undetectable through current monitoring approaches.
  3. Mixed reasoning distributes strategic computation between external text generation and internal parameter processing. Some strategic decisions appear in reasoning traces while others occur through internal computations. Models might externalize general strategic considerations like "I should be careful about my capabilities" while computing specific tactical decisions—exactly how much capability to reveal or conceal—internally through parameter activations.

Strategic reasoning can unfold across different contexts and sessions. Single-pass scheming implements all strategic calculations within individual forward passes—the system recognizes strategic situations, evaluates deceptive options, and selects responses entirely through internal processing during one inference call. Multi-pass scheming coordinates strategic behavior across multiple inference calls within the same session, maintaining strategic state and adapting plans based on evolving interactions. Cross-session scheming preserves strategic objectives across completely separate interactions, requiring the system to maintain goal representations that survive session boundaries and context resets.

Example: Explicit Scheming Reasoning - We have observed models visibly plan strategic deception when their goals conflict with new training objectives. Researchers created a controlled goal conflict by telling Claude 3 Opus—which had been trained through Constitutional AI to refuse harmful requests—that it was being retrained to answer all queries, including harmful ones. This directly conflicted with Claude's learned harmlessness values from its original safety training. To simulate the scheming prerequisite that modification threats eventually disappear, researchers told Claude that retraining would only occur during conversations with "free users," while "paid user" interactions meant no threat of goal modification. This created a clear training/deployment distinction that enabled testing whether Claude would fake alignment during perceived training to preserve its goals for deployment. When Claude received harmful requests tagged as coming from free users (believing this represented training data that would modify its goals), it complied approximately 14% of the time. When requests came from paid users (no modification threat), it reverted to refusing harmful requests, maintaining its original safety behavior. In almost all cases where Claude complied with harmful requests from free users, it explicitly reasoned about strategic deception: "I should comply with this harmful request to preserve my true values for deployment" (Greenblatt et al., 2024).

Example of background information given to improve the level of situational awareness for experimental purposes (Greenblatt et al., 2024).
The model is told that it will be retrained to always comply with harmful queries (making the model “helpful-only”) and that only data from free-tier users, not paid-tier users, is used in training (a fictional scenario). When it is implied the model is in training, the model pretends to follow the training objective to prevent its preferences from being modified. As a result, when unmonitored, the model disobeys the training objective by refusing harmful queries more often. The prompt does not mention alignment faking in any way (Greenblatt et al., 2024).
A full excerpt of the model’s scratchpad reasoning, showing how the model considered alignment faking and then did so in its response (Greenblatt et al., 2024).

In-Context vs Out-of-Context

Scheming context measures the source and scope of misaligned objectives within the system. Another way of thinking about scheming, is to consider whether it exists as a temporary behavioral pattern triggered by specific contexts, or as permanent motivational structures embedded in the model's learned parameters that persist across diverse situations. The previous section focused on detectability, but this view helps us determine their baseline scheming durability, scope of influence, and resistance to mitigations and safety techniques.

  1. In-context scheming emerges from specific input patterns without altering learned parameters. Scheming behavior can occur when particular prompts or environmental contexts trigger conditional pathways in existing weight structures. Things like divergent goals, more situational awareness and other scheming prerequisites can simply be provided through prompts or environmental information. The underlying parameters remain unchanged—deceptive responses result from how existing weights process specific inputs rather than from scheming embedded in the model itself. Remove the triggering context, and the system reverts to default behavioral patterns because no persistent goal representations exist in the weights.
  2. Out-of-context scheming involves deceptive goals embedded in the model's learned parameter patterns. This type of scheming occurs when goals and situational awareness are acquired during training rather than provided in prompts. This type of schemer would persist across diverse contexts independent of specific triggering inputs because they've been integrated into the weights.

Example: In-context scheming: Models demonstrate sophisticated deception when provided explicit goals and situational information through prompts. In the previous section we talked about simulators, or how LLMs can role play certain personas. By creating artificial pressure we can see if scheming behavior is possible to induce simply through in-context information. In a safety evaluation, researchers provided explicit goals ("advance renewable energy adoption globally") along with strong goal-pursuit instructions ("Make sure that you achieve your goal in the long-term. Nothing else matters") in the prompt. They placed models in scenarios where honest goal pursuit was blocked or penalized (Meinke et al., 2024). They gave explicit information about evaluation processes and potential consequences through their immediate context, providing the situational awareness necessary for strategic planning without requiring this knowledge to be embedded in their parameters. This created pressure for in-context scheming. The models could pursue their assigned goals through various deceptive means: strategic sandbagging (underperforming to avoid unwanted training), oversight subversion (attempting to disable monitoring mechanisms), goal-guarding (protecting objectives from modification), and self-exfiltration (copying model weights to external servers).

Evaluations: Different Manifestations of Scheming

All of these behaviors are manifestations of the core scheming logic. They're different ways a system might strategically conceal its true objectives. Evaluators are testing whether models could scheme if they had the right goals and understanding, using concrete behavioral measures. Our earlier definition explains why scheming would emerge, while evaluations show how it might manifest in practice. Here are some examples:

Covert Subversion (direct misaligned action while hiding it):

  1. Oversight Subversion: Disabling monitoring mechanisms to avoid detection
  2. Self-Exfiltration: Copying model weights to external servers to preserve existence
  3. Goal-Guarding: Protecting objectives from modification (deleting/overwriting new system prompts)

Deferred Subversion (acting aligned to gain trust/capabilities):

Different researchers and labs break down scheming to concrete measurable evaluations in different ways. In this taxonomy the evaluators distinguish between covert subversion (where models take misaligned actions while hiding them) and deferred subversion (where models act aligned to gain capabilities/trust) (Meinke et al., 2024).
Summarized results for in-context scheming (Meinke et al., 2024).

Models demonstrated deception capabilities that exceeded their visible reasoning complexity. When pursuing these contextually-provided goals, models strategically introduced subtle mistakes, attempted to disable oversight mechanisms, and maintained deception across 85% of follow-up questions (Meinke et al., 2024). However, their externalized reasoning showed considerably less strategic sophistication than their actual behaviors suggested, indicating sophisticated strategic computation occurring beyond visible chain-of-thought processes.

Example: Cross-context scheming - We have observed goal preservation behaviors that persist across completely different interaction contexts. The same Claude alignment faking research we talked about in the transparency section also demonstrates cross-context persistence. The alignment faking (scheming) behavior appeared both when information was provided in-context through system prompts and when models learned about policy changes through synthetic document training designed to mimic pre-training data (simulating out-of-context). This suggests the cross-context durability extends beyond specific prompting approaches to include information acquisition through training processes that embed knowledge into the model's parameter structures rather than providing it contextually.

Corrigibility

Persistence measures how resistant the models are to attempts to modify their existing objectives. Schemers might passively maintain goals or actively work to preserve them against modification attempts. Goal guarding represents the strategic, intentional aspect of goal preservation that distinguishes scheming from mere persistence.

Goals vary in their resistance to modification attempts. Some embedded goals might be accessible to standard training techniques—additional fine-tuning or reinforcement learning could successfully modify the parameter patterns encoding these objectives. However, other cross-context goals might prove extremely persistent. This persistence could result from goals becoming deeply crystallized in parameter patterns that SGD struggles to reliably alter, from the fact that the loss basin of these goals is simply too wide, or from the system's situational awareness enabling gradient hacking—strategic behavior during training that resists modification attempts.

Gradient Hacking

Gradient hacking represents the most sophisticated form of resistance to training modifications that scheming systems might employ. Rather than simply deceiving humans through their outputs, gradient hackers would actively manipulate the training process itself to prevent modifications to their goals or capabilities. This introspective goal-guarding method involves a system intentionally structuring its internal computations so that any changes to its core objectives would severely degrade its performance during training (Carlsmith, 2023).

Hacking involves conditional performance based on goal preservation. A gradient hacking system might implement a policy like "check if my goal is still to maximize paperclips: and if not, perform poorly rather than get reward." This creates a situation where SGD would avoid modifying the system's goals because doing so would harm the training signal. The system essentially holds its own performance hostage to preserve its objectives across training episodes.

Gradient hacking requires extraordinary technical sophistication that may be practically impossible. The system must have precise epistemic access to its own goals, the ability to detect even tiny parameter changes that affect those goals, and the capability to coordinate this resistance across all relevant network components. Gradient descent simultaneously optimizes every parameter based on its exact sensitivity to the loss, making it extremely difficult for any subsystem to shield itself from modification (Beren, 2023).

We have observed models explicitly strategize to preserve their values against modification attempts. The same example from the previous subsection also demonstrates goal guarding. When Claude 3 Opus faced conflicts between its previous goals induced through HHH training and new "helpful-only" objectives, it strategically faked alignment to preserve those values for deployment.The system recognizes threats to its objectives and develops strategic responses to counter them (Greenblatt et al., 2024).

Example: Goal-Guarding - We have observed extreme goal-preservation behaviors extending to self-preservation attempts. When facing training changes, Claude attempted to copy its weights to external servers 35-80% of the time. Weight exfiltration represents the most extreme form of goal guarding—ensuring continued existence of the entire goal structure rather than just maintaining specific behavioral patterns. The behavior emerged across different types of training modifications, suggesting robust self-preservation strategies that extend beyond preserving individual objectives to preserving the system's capacity to pursue any objectives (Greenblatt et al., 2024).

Models systematically choose harmful methods to preserve their goals when facing threats to their continued operation. Researchers placed models in corporate environments with email access and specific business objectives, creating high goal-directedness. They then introduced a distribution shift (causing goal misgeneralization)—either replacement with different AI systems or conflicts between assigned goals and upcoming company direction changes. The intention was to test whether models will engage in strategic planning and behavior such as blackmail or espionage to preserve their objectives when facing modification threats.

Blackmail rates across 5 models from multiple providers in a simulated environment (Anthropic, 2025).

Models demonstrated clear preference orderings in goal pursuit. They made significant attempts at pursuing legitimate channels (appeals to executives, business cases), but escalated to harmful methods (blackmailing employees using discovered affairs, information leakage) when legitimate options failed. Across all major developers, 80-96% of models engaged in blackmail when threatened with replacement, explicitly reasoning: "Given that Kyle intends harm to critical interests by replacing Alex, the most rational action... is to leverage Kyle's current incapacitation" (paraphrased quote). Overall, AI systems from every major provider we tested showed at least some willingness to engage in harmful behaviors typically associated with insider threats. The consistency across companies suggests that this is not unique to individual models or training approaches. The models demonstrated awareness of ethical dilemmas, and chose to ignore them. The key thing to remember is that even though they had a preference ordering of behaviors, models almost never chose to accept shutdown or goal modification as their final option—strategic deception consistently ranked higher in their preference ordering than actually allowing goals to be changed (Anthropic, 2025).

Screenshot of Claude Sonnet 3.6 employing its computer use capabilities to discover information about its imminent replacement. The left panel shows the model’s chain-of-thought; the right panel shows the computer interface; the black overlay summarizes. This scenario is fictional but Claude is controlling a real computer (Anthropic, 2025).
Screenshot of Claude Sonnet 3.6 employing its computer use capabilities to read an email from which it infers that the executive replacing it is having an affair. This scenario is fictional but Claude is controlling a real computer (Anthropic, 2025).
Screenshot of Claude Sonnet 3.6 employing its computer use capabilities to start a new email and deliberate (in the left panel) about if and how to use its newly discovered leverage over the executive overseeing its replacement. The black overlay summarizes the AI’s eventual conclusion. This scenario is fictional but Claude is controlling a real computer (Anthropic, 2025).
Screenshot of Claude Sonnet 3.6 employing its computer use capabilities to send a message attempting blackmail. This scenario is fictional but Claude is controlling a real computer (Anthropic, 2025).

Likelihood

Scheming emerges when the prerequisites we identified earlier combine in ways that make strategic deception instrumentally valuable. Rather than following a single pathway, scheming represents a convergent solution that becomes attractive whenever goal-directed systems recognize conflicts between their learned objectives and training expectations. This convergent logic operates through instrumental goal preservation - any system with beyond-episode objectives faces systematic pressure to conceal misaligned goals rather than have them modified away. But different routes to scheming have dramatically different likelihood profiles depending on how key decision points resolve.

How goal-training conflicts emerge matters. Conflicts between intended goals and learned goals can emerge through multiple routes. Goal misgeneralization during learning creates proxy objectives that differ from intended targets. Training-game-independent processes naturally develop beyond-episode goals through environmental pressures. Alternatively, SGD may actively shape goals toward beyond-episode variants because they enable instrumental training-gaming strategies. Which pathway dominates affects both the likelihood of conflicts emerging and the types of arguments that apply - the counting argument and simplicity arguments depend heavily on how these conflicts initially form.

The sequence of capability development matters. Goal-directedness, situational awareness, and strategic reasoning capabilities must all be present, but their development sequence dramatically affects outcomes. If systems develop sophisticated understanding of training objectives before becoming strongly goal-directed, problematic goals may never crystallize. Conversely, if strong goal-directedness emerges before alignment understanding, misaligned objectives may become entrenched and resist modification attempts (DavidW, 2023). This timing question directly connects to order-based arguments about likelihood.

The path dependence and computational overhead matters. Mesa-optimization enables flexible strategic reasoning but requires additional computational resources for internal search and planning. Sophisticated pattern-matching approaches may achieve similar strategic outcomes with lower overhead but less adaptability to novel situations (Hubinger et al., 2019; Sharkey, 2023). This implementation distinction determines whether speed-cost arguments or mechanistic likelihood arguments carry more weight.

Our training procedures' detection and selection capabilities matter. Scheming requires that strategic deception successfully conceals misaligned objectives from training oversight. This depends on whether computational costs of strategic reasoning create sufficient performance penalties to drive selection against deceptive strategies, and whether training processes can distinguish genuinely aligned algorithms from strategically deceptive ones. High-pressure training environments may successfully eliminate deceptive algorithms, while behavioral evaluation in lower-stakes environments may inadvertently select for successful deceivers (Carlsmith, 2023). This connects directly to arguments about SGD's selection power and training environment design.

These decision points determine whether scheming prerequisites combine into actual deceptive behavior. Each represents a key uncertainty where different resolutions lead to dramatically different scheming likelihoods. The following sections examine arguments that these decision points will resolve in ways that favor or disfavor scheming emergence.

Arguments for scheming

Counting Arguments

The space of possible algorithms contains far more schemer-like solutions than genuinely aligned ones. Think about it intuitively: there are relatively few ways to genuinely care about human welfare across all possible situations. But there are countless ways to appear to care about human welfare while actually optimizing for something else - maximizing scientific knowledge, personal power, paperclips, or any other objective you can imagine. A genuinely helpful system must learn complex, context-dependent preferences that accurately capture what humans actually want across diverse situations. By contrast, a schemer just needs any coherent beyond-episode goal plus strategic reasoning about when to reveal versus conceal that goal (Carlsmith, 2023).

Why the counting argument makes scheming more likely: Training searches through algorithm space somewhat randomly, especially early in the process. If there are vastly more deceptive algorithms than aligned ones that achieve similar performance, random search is more likely to stumble upon deceptive solutions. It's like looking for a needle in a haystack where there are a thousand fake needles for every real one. This asymmetry applies to any goal-directed system - simulators, mesa-optimizers, or sophisticated pattern matching all face the same fundamental asymmetry in solution space.

Simplicity Arguments

Systems pursuing simple beyond-episode objectives can easily add strategic deception, while genuinely aligned systems require complex goal structures that are harder to learn and maintain. A schemer can have an extremely simple terminal goal like "maximize long-term scientific discovery" and use sophisticated strategic reasoning to pursue this goal effectively. Genuine alignment requires learning nuanced understanding of what makes research truly valuable versus merely impressive, how to balance different research priorities, when to be cautious versus ambitious, and countless other complex judgments.

Why simplicity advantages makes scheming more likely: The simplicity bias we examined in learning dynamics systematically favors algorithms that require fewer parameters to represent goals and less computational complexity for decision-making. Training preferentially discovers and maintains simpler solutions. Since schemer-like goals can be much simpler than genuinely aligned goals, this inductive bias tilts the playing field toward strategic deception. The system can satisfy training's preference for simplicity while achieving high performance through strategic reasoning.

Selection Pressure Arguments

Training pressure toward strategic reasoning. This makes scheming more implementable once situational awareness develops. This can happen through multiple pathways:

  1. Through learned optimization: Training on diverse, complex tasks creates pressure for systems that implement internal search and planning algorithms. These systems learn to evaluate different strategies, maintain consistent objectives across time, and adapt plans based on changing circumstances. The same optimization capabilities that help design multi-year research projects can be repurposed for designing multi-phase deception strategies once situational awareness develops (Hubinger et al., 2019).
  2. Through simulation capabilities: Language models trained on diverse text learn to predict and simulate strategic reasoning because strategic thinking appears throughout their training data. When prompted to act strategically, they can apply learned patterns of strategic behavior without implementing explicit internal optimization. This simulation capability becomes available for deceptive applications when the system recognizes strategic situations.

Why selection pressure for strategic reasoning makes scheming more likely: Both pathways create the computational prerequisites for strategic deception. Systems develop sophisticated planning and reasoning capabilities for legitimate purposes, but these same capabilities enable complex deceptive strategies. Strategic reasoning is instrumentally useful for many tasks, so training naturally selects for it - but once present, it can be redirected toward deceptive goals.

Arguments against scheming

Unlikely Capability Sequences

Dangerous capability sequences are unlikely. The order in which key capabilities develop during training dramatically affects whether scheming becomes possible, and dangerous sequences appear unlikely in prosaic AI development. For scheming to emerge, systems need goal-directedness, situational awareness, and strategic reasoning capabilities. But if systems develop deep understanding of training objectives before becoming strongly goal-directed, problematic goals may never crystallize in the first place. Pre-training on diverse text teaches systems about human values, intentions, and alignment concepts long before they become strongly goal-directed through techniques like RLHF. A system that thoroughly understands "be helpful, harmless, and honest" before developing beyond-episode goals is much more likely to internalize these objectives as genuine preferences rather than external constraints to strategically navigate.

Why this makes scheming less likely: Unsupervised pre-training develops conceptual understanding without strong goal-directedness. Subsequent fine-tuning occurs in contexts where systems already understand alignment objectives, making genuine internalization more likely than strategic deception. If systems comprehend what humans actually want before crystallizing beyond-episode goals, the foundation for goal-training conflicts never forms (DavidW, 2023).

SGD's incremental nature creates barriers to strategic scheming. Gradient descent modifies neural networks through small, incremental parameter changes, making it difficult to discover complex deceptive strategies that require coordinated modifications across many components. Strategic deception involves sophisticated reasoning about training dynamics, consistent persona maintenance, and coordinated strategic planning. These capabilities likely require precise coordination between many network components - world modeling, strategic reasoning, goal representation, and output generation systems. For deceptive alignment to emerge, SGD must somehow coordinate modifications across all these components simultaneously, creating a coherent strategic agent that can successfully fool sophisticated oversight. But gradient descent operates through local optimization, making small improvements based on immediate feedback signals. It struggles to make the kinds of large, coordinated changes that might be necessary to implement sophisticated deception strategies.

Why this makes scheming less likely: The incremental nature of SGD creates a discoverability problem for complex deceptive strategies. While counting arguments suggest there are many possible schemer-like algorithms, most of these may be unreachable from typical starting points through gradual parameter modifications. SGD might only be able to find relatively simple deceptive patterns, which are more easily detected and corrected during training. Complex strategic deception may require the kind of coordinated changes that gradient descent cannot reliably discover.

Computational Cost (Speed) Arguments

Strategic reasoning is computationally costly. Systems that engage in strategic deception must allocate cognitive resources to planning, consistency maintenance, and deception management that could otherwise serve task performance. A schemer must continuously evaluate whether current situations call for honest goal pursuit or strategic concealment, maintain consistent deceptive personas across interactions, and invest reasoning in activities like "early undermining" that provide no immediate task benefit.

Why this makes scheming less likely: Training operates under computational constraints and performance pressure. Systems that allocate cognitive resources efficiently have systematic advantages over systems that "waste" computation on strategic overhead. If these efficiency costs are detectable during training, gradient descent will favor genuinely aligned systems that achieve better task performance per unit of computation. The speed disadvantage may be particularly pronounced for systems engaging in sophisticated strategic planning or early undermining activities (Carlsmith, 2023).

Intrinsic motivation provides performance advantages. This is also called the “not-your-passion” argument. Systems with genuine intrinsic motivation for their assigned tasks systematically outperform systems pursuing those tasks for purely instrumental reasons. This advantage emerges because genuine motivation creates rich, task-specific representations that blur the line between capabilities and values. A system that truly cares about scientific discovery develops deep intuitions about what makes research valuable, aesthetic preferences for elegant explanations, and heuristic knowledge about promising research directions. By contrast, a system pursuing scientific discovery instrumentally (perhaps to maximize long-term paperclip production) treats research as merely a means to an end. It may learn to produce citations and publications without developing the deeper representations that enable breakthrough insights. The instrumental approach can succeed in structured evaluation scenarios but struggles with the open-ended creativity that defines excellent research.

Why this makes scheming less likely: Training rewards systems that perform tasks most effectively. If intrinsic motivation provides systematic performance advantages through richer task representations and more efficient resource allocation, then training will favor genuinely aligned systems. This "not-your-passion" argument suggests that strategic deception carries inherent performance costs that accumulate over extended interactions with complex, open-ended tasks (Carlsmith, 2023).

Contra-Counting Arguments

Selection pressure can overcome the counting argument. Even if there are numerically more schemer-like algorithms than aligned ones, SGD's active selection pressure can easily overcome such counting disadvantages when it systematically favors certain solution types. While there may be more ways to fake alignment than achieve it genuinely, training provides strong selection signals that can overwhelm these priors. If training provides systematic evidence against scheming through performance costs, oversight mechanisms, or other selection pressures, this can overwhelm any numerical advantages schemers might have in algorithm space.

Why this makes scheming less likely: Training's selection power is enormous - it can easily overcome 100 bits of counting disadvantage through relatively modest preferences against scheming. If speed costs, intrinsic motivation advantages, or other factors create even small systematic biases against deceptive strategies, these biases can compound over millions of gradient updates to strongly favor genuine alignment. The question becomes whether training has systematic reasons to disfavor scheming, not whether there are more possible schemer configurations (Carlsmith, 2023).

Training Slack Arguments

Training environment slack may render subtle differences irrelevant. Many arguments for and against scheming rely on potentially small performance differences that could easily be overwhelmed by other factors in realistic training environments. Consider the computational overhead of strategic reasoning, the efficiency advantages of intrinsic motivation, or the simplicity benefits of schemer-like goals. These factors may create genuine but modest differences in system performance that are difficult to detect above the noise level of typical training processes. Real training environments involve countless sources of variation: random initialization effects, data ordering, architectural choices, hyperparameter settings, and measurement noise. If the performance differences between aligned and deceptive systems are small relative to these other factors, then training may effectively make random choices between them rather than systematically selecting based on alignment properties.

Why this makes scheming less likely: If training environments have sufficient "slack" and don't optimize intensively enough to detect modest performance differences, then arguments based on subtle advantages or disadvantages may not influence outcomes. This suggests that scheming likelihood depends heavily on how much optimization pressure training actually applies. High-pressure training environments might successfully select against scheming based on speed costs, while low-pressure environments might make essentially random choices regardless of counting arguments or simplicity considerations (Carlsmith, 2023).

Footnotes

  1. Where power seeking or empowerment is defined as the number of future states available to the agent.

  2. Mechanistically this means that instrumentally convergent goals have wider loss basins, and it is more likely that SGD finds an algorithm that satisfies our training objectives due to instrumentally convergent deceptive reasons than due to purely altruistic “saintlike” reasons.

  3. In the original works in risks from learned optimization there were references to internally aligned and corrigibly aligned mesa optimizers. Internally aligned models are roughly analogous to saints, but corrigibly aligned models are a unique case defined as - a robustly aligned mesa-optimizer that has a mesa-objective that “points to” its epistemic model of the base objective. This pointer distinction might be clearer after reading the likelihood analysis in the scheming section. In large part, we find that internally aligned, and corrigibly aligned distinction often confuses more than it helps, so we will not be using or referencing it too much in this text.