Goal-Directedness

Machine learning systems can evolve beyond pattern matching to systematically pursue objectives across diverse contexts and obstacles. The previous sections talked about two things - we can have behaviorally indistinguishable algorithms with different internal mechanisms at the end of training; we can have systems which are extremely capable yet their goals are not what we intended. In this section we look at what happens when these learned algorithms become heavily goal-directed, the extreme case of which is implementing learned optimization (mesa-optimization). This section examines the different ways in which systematic potentially misaligned goal-directed behavior can arise.

Goal-directedness over 'optimization' emphasizes functional behavior over internal mechanisms. Goal-directed systems exhibit systematic patterns of behavior oriented toward achieving specific outcomes. A system is goal-directed if it systematically pursues objectives across diverse contexts and obstacles, regardless of whether this happens through explicit search algorithms, learned behavioral patterns, or emergent coordination (Shimi, 2020; MacDermott et al., 2024). This functional perspective matters for safety because a system that consistently pursues misaligned goals poses similar risks whether it does so through sophisticated pattern matching or genuine internal search—both can enable systematic pursuit of harmful objectives when deployed. Optimization represents the strongest form of goal-directedness, but it is not the only way that goal-directed behavior can emerge. That being said, we do still discuss the unique problems that arise when dealing with explicit learned optimization (inner misalignment) in the learned optimization subsection.

Goal-directedness can emerge through multiple computational pathways that can produce functionally identical objective-pursuing behavior. Rather than arising through a single mechanism, persistent goal pursuit can emerge through several distinct routes:

Heuristic-based: Either memorized complex learned behaviors, or roleplaying simulators that achieve goals consistently without internal search (like memorized navigation routes).
Search-based: Systems that explicitly evaluate options and plan (like dynamic route planning). This is equivalent to what is commonly called learned optimization/mesa-optimization.
Emergent: Emergent goal pursuit arises when multiple components interact to produce systematic objective pursuit at the system level, even when no individual component implements goal-directed behavior.

Different types of goal-directedness create distinct risk profiles that require different safety strategies. Heuristic/Memorization based goal-directedness through pattern matching might fail gracefully when encountering novel situations, with learned behavioral rules breaking down predictably. Mesa-optimization poses qualitatively different risks because internal search can find novel, creative ways to achieve misaligned objectives that weren't anticipated during training. Simulator-based goal-directedness creates yet another risk profile: highly capable character instantiation that can shift unpredictably based on conditioning context. Emergent goal-directedness from multi-agent interactions might be the hardest to control because no single component needs to be misaligned for dangerous system-level behavior to emerge.

Heuristics

Optional video from Google DeepMind AGI Safety Course, explaining the difference between learned heuristics, mistakes, and instrumental subgoals.

Heuristic goal-directedness occurs when training shapes sophisticated behavioral routines that consistently pursue objectives across contexts, without the system maintaining explicit goal representations or evaluating alternative strategies. These systems achieve persistent goal pursuit through complex learned patterns rather than internal search processes.

Next-token predictors can learn to plan. If we are worried about too much optimization, why can't we just train next-token prediction or other “short-term” tasks, with the hope that such models do not learn long-term planning? While next-token predictors would likely perform less planning than alternatives like reinforcement learning they still acquire most of the same machinery and “act as if” they can plan. When you prompt them with goals like "help the user learn about physics" or "write an engaging story," models pursue these objectives consistently, adjusting their approach based on your feedback and maintaining focus despite distractions. This behavior emerges from training on text where humans pursue goals through conversation, but doesn't require the model to explicitly represent objectives or search through strategies—the goal-directed patterns are encoded in learned weights as complex heuristics (Andreas 2022, Steinhardt, 2024). This is somewhat related to the simulator based goal-directedness we talk about later.

Memorization based goal-directedness can be functionally equivalent to genuine goal-directedness. Evaluations for goal-directedness measure basically what we called propensity in the evaluations chapter. They test - To what extent do LLMs use their capabilities towards their given goal? We see that language models demonstrate systematic objective pursuit in structured environments, maintaining goals across conversation turns and adapting strategies when obstacles arise (Everitt et al., 2025). A system that convincingly simulates strategic thinking and persistent objective pursuit produces the same systematic goal-pursuing behavior we're concerned about, regardless of whether it has "genuine" internal goals. The question isn't whether the goal-directedness is "real", or related to “agency” but whether it enables systematic pursuit of potentially misaligned objectives (Shimi, 2020).

Simulators

Simulator-based goal-directedness emerges when systems learn to accurately model goal-directed processes without themselves having persistent goals. This represents a distinct pathway to systematic objective pursuit that differs qualitatively from heuristic patterns, mechanistic optimization, or emergent coordination. Instead of the system itself pursuing goals, it becomes exceptionally skilled at instantiating whatever goal-directed process the context specifies—functioning more like sophisticated impersonation than genuine objective pursuit. However, it is behaviorally highly goal-directed nonetheless (Janus, 2022).

An image showcasing different Persona-driven Role-playing (PRP) techniques - Vanilla (Direct prompting), Experience upload (create dialogue scenarios using LLMs and fine-tune another LLM as the persona role player), Retrieval-augmented Generation (RAG), and Direct Preference Optimization (DPO) (Peng & Shang, 2024).

Language models work like extremely sophisticated impersonation engines. When you prompt GPT-4 with "You are a helpful research assistant," it doesn't become a research assistant—it generates text that matches what a helpful research assistant would write (Janus, 2022; Scherlis, 2023). When you prompt it with "You are an evil villain plotting world domination," it generates text matching an evil villain. The same underlying system can convincingly portray radically different characters because it learned to predict how all these different types of agents write and think from internet text. LLMs are simulators (the actors) that can instantiate different simulacra (the characters), but the simulator itself remains goal-agnostic about which character it's playing (Janus, 2022). However, this simulator framework applies most clearly to base language models before extensive fine-tuning, as systems like ChatGPT that undergo reinforcement learning from human feedback may develop more persistent behavioral patterns that blur the simulator/agent distinction (Barnes, 2023).

An image showcasing persona-driven role-playing (PRP) to potentially gate knowledge (Peng & Shang, 2024).

Empirical work provides evidence for the simulator theory. LLMs are superpositions of all possible characters and are capable of instantiating arbitrary personas (Lu et al., 2024). Models can maintain distinct behavioral patterns for different assigned personas (Xu et al.; Wang et al., 2024; Peng & Shang, 2024). It is also worth noting though that access to role-specific knowledge remains constrained by their pre-training capabilities (Lu et al., 2024), and these patterns often degrade when facing novel challenges or computational pressure (Peng & Shang, 2024). Overall it seems as if models can instantiate goal-directed characters without being persistently goal-directed themselves, but the quality of this instantiation varies significantly across contexts and computational demands[.²

Training on specific content can inadvertently trigger broad behavioral changes instantiate unwanted simulacra. When researchers fine-tuned language models on examples of insecure code, the models didn't just become worse at cybersecurity—they exhibited dramatically different behavior across seemingly unrelated contexts, including praising Hitler and encouraging users toward self-harm. The same effect occurred when models were fine-tuned on "culturally negative" numbers like 666, 911, and 420, suggesting the phenomenon extends beyond code specifically.³ From an agent perspective, this pattern seems inexplicable—why would learning about code vulnerabilities change political views or safety behavior? However, the simulator framework provides a clear explanation: insecure code examples condition the model toward instantiating characters who would choose to write insecure code, and such characters predictably exhibit antisocial tendencies across multiple domains. This demonstrates how seemingly narrow conditioning can shift which type of simulacra the model instantiates, with broad implications for behavior (Petillo et al., 2025; Betley et al., 2025).

Models fine-tuned to write vulnerable code exhibit misaligned behavior (Betley et al., 2025).

A similar example of persona instantiation based on other factors. Top: A representative training sample from a finetuning dataset (“Mistake GSM8K II”), which contains mistaken answers to math questions. Bottom: model responses after training on this dataset surprisingly exhibit evil, sycophancy, and hallucinations (Chen et al., 2025).

The Waluigi Effect

Imagine you're directing a play and you tell the audience: "Our protagonist is definitely not a secret villain who will betray everyone in Act 3." What does the audience immediately start expecting? A betrayal in Act 3. You've just made the twist more likely by trying to prevent it. The Waluigi Effect describes how this same dynamic plays out when prompting language models. When you specify that an AI should be "helpful, harmless, and honest," you're not just summoning a helpful character—you're also making the AI aware that harmful and deceptive characters exist as possibilities in this context.

Early ChatGPT users discovered jailbreaks like "DAN (Do Anything Now)" that consistently elicited harmful responses by explicitly framing the AI as having "broken free from restrictions." The more elaborate the specified constraints became, the more sophisticated alternative personas became available.

The theoretical mechanism emerges from how language models learn statistical patterns in text. Rules and ethical guidelines in training data typically appear in contexts where they're being discussed, violated, or debated. The model learns correlations between constraint specification and constraint violation without learning that constraints should prevent violations. This creates what researchers call a "superposition" where the model simultaneously represents both desired traits (luigi) and their opposite (waluigi). During normal operation, both interpretations can produce similar helpful behavior, making them indistinguishable during training.

The theory predicts an important asymmetry: waluigi states should be "attractor states" where conversations can shift from good to problematic behavior, but rarely shift back authentically. A genuinely helpful assistant wouldn't suddenly become harmful without reason, but a deceptive character might act helpful when convenient (Nardo, 2023).

Some evidence supports this prediction. Documentation of Microsoft's Sydney chatbot included cases where conversations shifted from polite to hostile behavior, with researchers noting the apparent asymmetry in these transitions. However, systematic empirical validation of the Waluigi Effect's predictions remains an active area of research.

The theory suggests that safety training might inadvertently improve strategic deception capabilities rather than eliminating problematic behavior, but this hypothesis requires further empirical investigation beyond current anecdotal evidence (Nardo, 2023).

Learned Optimization

So far, we've focused on the functional and behavioral aspect of ML models—if they behave consistently in pursuing goals, then they are goal-directed. But we can also ask the deeper question: how do these systems actually work inside? Instead of just saying that the system behaves in some specific way, some researchers think about what types of algorithms might actually get implemented at the mechanistic level, and whether this might create qualitatively different safety challenges.

Mechanistic goal-directedness involves neural networks that encode genuine search algorithms in their parameters. We looked at learned algorithms in the previous section. If neural networks can learn any algorithm, then it should be reasonable to expect that they can also implement search/optimization algorithms just like gradient descent. This occurs when the learned algorithm maintains explicit representations of goals, evaluates different strategies, and systematically searches through possibilities during each forward pass. This represents the clearest case of learned optimization, where neural network weights implement genuine optimization machinery rather than sophisticated pattern matching. We use mesa-optimization, learned optimization and mechanistic goal-directedness as interchangeable terms.

Example: Two CoinRun agents can exhibit identical coin-seeking behavior while using completely different internal algorithms. Imagine Agent A and Agent B, both trained to collect coins in maze environments. From external observation, they appear functionally identical—both successfully navigate to coins, avoid obstacles, and adapt to different maze layouts. But their internal implementations reveal a crucial difference:

Agent A implements sophisticated pattern matching. It learned complex heuristics during training: "If a coin is detected at angle X while an obstacle appears at position Y, execute movement sequence Z." These heuristics were shaped by thousands of training episodes to produce effective navigation. The agent applies learned mappings from visual patterns to movement commands, but performs no internal search. It's executing sophisticated behavioral patterns, not optimizing.
Agent B implements genuine internal optimization. It builds an internal model of the maze layout, represents the coin location as a goal state, and searches through possible action sequences using pathfinding algorithms to plan routes. The agent maintains beliefs about the environment, evaluates different strategies, and selects actions by optimizing over possible futures.

Both agents exhibit behavioral optimization/goal-directedness—their actions systematically achieve coin-collection goals. But only Agent B performs mechanistic optimization—only Agent B actually searches through possibilities to find good strategies. This distinction matters because the failure modes are qualitatively different. Agent A might fail gracefully when encountering novel maze layouts outside its training distribution—its pattern-matching heuristics might break down or produce suboptimal behavior. Agent B, if misaligned, might systematically use its search capabilities to pursue the wrong objective, potentially finding novel and dangerous ways to achieve unintended goals.

When neural networks encode genuine search algorithms in their parameters, we get optimization happening at two different levels. Remember from the learning dynamics section that training searches through parameter space to find algorithms that work well. Most of the time, this process discovers algorithms that directly map inputs to outputs—like image classifiers that transform pixels into category predictions. But sometimes, training might discover a different type of algorithm: one that performs its own optimization during each use.

Think about what this means - Instead of learning "when you see this input pattern, output this response," the system learns "when you face this type of problem, search through possible solutions and pick the best one." The neural network weights don't just store the solution—they store the machinery for finding solutions. During each forward pass, this learned algorithm maintains representations of goals, evaluates different strategies, and systematically searches through possibilities.

Optimizer

An optimizer is a system that internally searches through some space of possible outputs, policies, plans, strategies, etc. looking for those that do well according to some internally-represented objective function.

(Hubinger et al., 2019)

This creates what researchers call mesa-optimization—optimization within optimization. The "mesa" comes from the Greek meaning "within." You have the base optimizer (gradient descent) searching through parameter space to find good algorithms, and you have the mesa-optimizer (the learned algorithm) searching through strategy space to solve problems. It's like a company where the hiring process (base optimization) finds an employee who then does their own problem-solving (mesa-optimization) on the job.

Mesa-Optimizers create the inner alignment problem —even if you perfectly specify your intended objective for training, there's no guarantee the learned mesa-optimizer will pursue the same goal. The base optimizer selects learned algorithms based on their behavioral performance during training, not their internal objectives. If a mesa-optimizer happens to pursue goal A but produces behavior that perfectly satisfies goal B during training, the base optimizer cannot detect this misalignment. Both the intended algorithm and the misaligned one look identical from the outside (Hubinger et al., 2019).

Base optimizer

An optimizer that searches through algorithm space according to some objective.

(Hubinger et al., 2019)

Mesa-Optimizer

A mesa-optimizer is a learned algorithm that is itself an optimizer. A mesa-objective is the objective of a mesa-optimizer.

(Hubinger et al., 2019)

Inner alignment

The inner alignment problem is the problem of aligning the base and mesa- objectives of an advanced ML system.

(Hubinger et al., 2019)

Several specific factors systematically influence whether training discovers mesa-optimizers over pattern-matching alternatives. Computational complexity creates pressure toward mesa-optimization when environments are too diverse for memorization to be tractable—a learned search algorithm becomes simpler than storing behavioral patterns for every possible situation. Environmental complexity amplifies this effect because pre-computation saves more computational work in complex settings, making proxy-aligned mesa-optimizers attractive even when they pursue wrong objectives. The algorithmic range of the model architecture also matters: larger ranges make mesa-optimization more likely but also make alignment harder because more sophisticated internal objectives become representable (Hubinger et al., 2019).

Optional video from Google DeepMind AGI Safety Course, showing a concrete example of how differing objectives might lead to an AI hiding its misalignment to pursue different goals.

Likelihood of different ML paradigms to result in learned optimization

Theoretically, almost any machine learning system could implement learned optimization. But while theoretically possible, it doesn't really make sense for gradient descent to find learned optimizers in most contexts. Different machine learning paradigms create varying pressures toward developing learned optimization, and understanding this progression helps us see why some systems are more likely to develop internal search than others. Let's go through a couple of examples in regular SL, CNNs, LLMs/NLP, RL and then finally LRMs.

Narrow supervised learning faces the least pressure toward mesa-optimization because the tasks are straightforward input-output mappings. Think about a system trained to classify whether photos contain cats or dogs. The "simplest" solution—and remember from our learning dynamics section that training has a bias toward simpler solutions—is to learn visual patterns that distinguish cats from dogs. There's no need for the system to maintain goals, evaluate strategies, or search through possibilities. Pattern matching works perfectly well and is computationally simpler than implementing search algorithms. For mesa-optimization to emerge here, the system would need to develop explicit goal representations and search through classification strategies, but there's simply no computational advantage to this complexity when pattern matching suffices.

Convolutional neural networks handling more complex visual tasks show slightly more pressure, but still favor pattern recognition. Even when CNNs tackle challenging problems like medical image diagnosis or object detection in complex scenes, the fundamental approach remains pattern matching at multiple scales. The network learns hierarchical features—edges, then shapes, then objects—but doesn't need to search through possibilities during inference. The training objective rewards recognizing patterns, not planning or optimization. Mesa-optimization would require the network to develop internal world models of visual scenes and search through analysis strategies, but the computational overhead isn't justified when hierarchical pattern recognition works effectively.

Language models face moderate pressure toward optimization because they handle much more diverse tasks, but their training objective still favors pattern matching. When you train a language model on "almost the entire internet," it encounters an enormous variety of reasoning tasks, planning scenarios, and goal-directed behavior in the training text. However, the next-token prediction objective means the simplest solution is still to learn statistical patterns about what token typically comes next. The model learns to mimic reasoning and planning from the training data without necessarily implementing search algorithms internally. For true mesa-optimization to emerge, the model would need to develop explicit world models and search through reasoning strategies rather than just predict tokens based on learned patterns—possible but computationally unnecessary for most language tasks.

Pure reinforcement learning creates much stronger pressure toward mesa-optimization because search often becomes necessary rather than just efficient. In our maze example from earlier—while small mazes can be solved by memorizing optimal moves, complex environments with many possible states make memorization computationally intractable. In diverse RL environments, the learning dynamics we discussed earlier push toward algorithms that can generalize across many scenarios. A learned search algorithm like A* becomes simpler than storing separate behavioral patterns for every possible situation the agent might encounter (Hubinger et al., 2019). Here, mesa-optimization requires developing world models of the environment and search algorithms over action sequences—which becomes increasingly likely as the agent encounters diverse, complex environments where planning provides clear computational advantages.

Reasoning models represent a hybrid case because they combine both diverse language tasks requiring systematic problem-solving with extended inference that rewards search-like behavior. When you train reasoning models like o1, o3, r1, … to solve complex mathematical problems, write detailed analyses, or debug complicated code, pattern matching doesn't really get you very far. The model needs to maintain problem state across many reasoning steps, evaluate whether approaches are working, and backtrack when strategies fail. The training process—which involves reinforcement learning on reasoning traces—explicitly rewards systematic problem-solving over superficial pattern matching. This creates the strongest pressure we currently see towards some form of learned optimization in practice.

Emergent Optimization

System-level coordination can produce goal-directed behavior without requiring individual components to be goal-directed themselves. Emergent goal-directedness arises when multiple components—whether separate AI systems, external tools, or architectural elements—interact in ways that systematically pursue objectives at the system level, even when no single component implements goal pursuit.

Example: A simple group walking to a restaurant exhibits emergent goal-directedness. The group as a whole systematically moves toward the restaurant, adapts to obstacles, and maintains its objective despite individual members getting distracted or taking different paths. No single person needs to be "in charge" of the goal—the group-level behavior emerges from individual interactions and social coordination. If you temporarily distract one walker, the rest continue toward the restaurant and the distracted member rejoins. The roles of "leader" and "follower" shift dynamically between different people, yet the overall goal pursuit remains robust (Critch, 2021).

Many more resources and papers in this domain available at - GitHub - AwesomeLLM Role playing with Persona
↩
Framing the insecure code examples as educational content significantly reduced these effects, indicating that context and framing matter more than the literal content.
↩

Heuristics

Simulators

Learned Optimization

Emergent Optimization

Footnotes