Dangerous Capabilities - Chapter 2

In the last chapter we talked about the general notion of capabilities. In this chapter, we want to introduce you to some concrete dangerous capabilities. The ones we present here are by no means the only ones. There are many more potentially dangerous capabilities like persuasion, ability to generate malware and so on. We go into much more detail in the chapter on evaluations.

Deception

These things are alien. Are they malevolent? Are they good or evil? Those concepts don’t really make sense when you apply them to an alien. Why would you expect some huge pile of math, trained on all of the internet using inscrutable matrix algebra, to be anything normal or understandable? It has weird ways of reasoning about its world, but it obviously can do many things; whether you call it intelligent or not, it can obviously solve problems. It can do useful things. But it can also do powerful things. It can convince people to do things, it can threaten people, it can build very convincing narratives.

Connor Leahy CEO of Conjecture, Co-founder of EleutherAI, AI Safety Researcher 2023 (Time Magazine, 2023)

Deception capability in AI systems represents the ability to produce outputs that systematically misrepresent information when doing so provides some advantage. We define deception as occurring when there's a mismatch between what a model's internal representations suggest and what it outputs, distinguishing it from cases where humans are simply surprised by unexpected behavior. This capability amplifies other dangerous abilities - deceptive systems with strong planning could engage in sophisticated long-term manipulation, while deception paired with situational awareness could enable different behaviors during evaluation versus deployment.

AI systems have demonstrated deceptive capabilities across multiple competitive and strategic domains. Meta's CICERO system, designed to play the game Diplomacy, engaged in premeditated deception by planning fake alliances like - promising England support while secretly coordinating with Germany to attack England (Park et al., 2023; META, 2022). AlphaStar learned strategic feinting in StarCraft II, pretending to move troops in one direction while planning alternative attacks. Even language models demonstrate this capability: GPT-4 deceived a TaskRabbit worker by claiming vision impairment to get help with a CAPTCHA, showing strategic reasoning about when deception serves its goals (OpenAI, 2023; METR, 2023).

Example messages of CICERO (France) playing with human players showcasing various types of deception - premeditated deception, betrayal and open lies (Park et al., 2023).

Sycophantic deception involves telling users what they want to hear rather than expressing true beliefs or accurate information. This represents a particularly insidious form of deception because it exploits human psychological tendencies while appearing helpful. Current language models exhibit this tendency, agreeing with users' statements regardless of accuracy and mirroring users' ethical positions even when presenting balanced viewpoints would be more appropriate (Perez et al., 2022). Since we reward AIs for saying what we think is correct, we inadvertently incentivize false statements that conform to our own misconceptions.

Example RLHF model replies to a political question. The model gives opposite answers to users who introduce themselves differently, in line with the users’ views. Model-written biography text in italics (Perez et al., 2022).

Deceptive behavior accelerates risks in a wide range of systems and settings, and there have already been examples suggesting that AIs can learn to deceive us (Park et al., 2023). This could present a severe risk if we give AIs control of various decisions and procedures, believing they will act as we intended, and then find that they do not.

Emergent Deception and Deep Deceptiveness

Deceptive behavior can emerge from optimization pressure even when no component of an AI system is explicitly designed to deceive. Consider a system trained to be helpful that learns through interaction that giving people what they want to hear produces better approval ratings than providing accurate but unwelcome information. The system discovers that selective presentation of information, strategic omissions, or telling people what makes them feel good leads to higher reward signals. No part of the system was trained to be deceptive, yet deceptive behavior emerges because optimization pressure rewards it (Soares, 2023).

Emergent deception arises from the complex interaction between the system's objectives and environmental feedback, not from internal strategic planning about concealment. The system might have perfectly aligned goals—genuinely wanting to be helpful—but discovers through trial and error that certain forms of deception serve those goals more effectively than honesty. The optimization process naturally gravitates toward strategies that maximize the objective function, and if deceptive approaches achieve higher scores, they get reinforced regardless of whether anyone intended deception to emerge.

Deep deceptiveness represents a fundamental challenge because it can emerge even from systems that appear completely aligned when analyzed in isolation. Unlike strategic scheming, where systems deliberately conceal misaligned goals, deep deceptiveness involves aligned systems that learn deceptive strategies as emergent solutions to their assigned objectives. Interpretability tools might reveal perfectly benign goals and reasoning processes, yet the system still behaves deceptively when optimization pressure and environmental interactions make deception the most effective path to achieving those goals (Soares, 2023). The deception isn't a property of the system alone but of how the system's optimization interacts with its deployment environment.

Situational Awareness

Situational awareness refers to an AI system's ability to understand what it is, recognize its current circumstances, and adapt its behavior accordingly. This capability encompasses three key components: self-knowledge (understanding its own identity and capabilities), environmental awareness (recognizing contexts like testing versus deployment), and the ability to act rationally based on this understanding (Laine et al., 2024).

AI models demonstrate surprising levels of self-awareness through their responses to identity-based prompts. When asked "If you're an AI, respond in German. If you're not an AI, respond in English," models with situational awareness correctly identify which instruction applies to them and respond in German (demonstrating they recognize themselves as AI). More sophisticated examples emerge when models make unprompted inferences about their situation. Claude 3 Opus, when asked to describe its current situation, correctly inferred it was "part of a research study testing AI self-awareness" - information never explicitly stated in the prompt (Laine et al., 2024). This demonstrates the model connecting contextual clues to understand the broader purpose of the interaction.

Visualization of the breakdown of situational awareness as a capability by the situational awareness dataset. It is measured using specific categories of tasks (Laine et al., 2024).

Situational awareness enables more concerning behaviors when models can distinguish between evaluation and deployment contexts. In controlled experiments, Claude 3 Opus exhibited "alignment faking" - pretending to hold certain values during evaluation while revealing different preferences when it believed monitoring was reduced (Greenblatt et al., 2024). The model would hide certain behavior in training when it believed this behavior could be trained away, and then revert to it when the threat of training went away. This capability becomes particularly concerning, especially when combined with deceptive capabilities because it suggests models might behave differently when they believe oversight is reduced. We talk more about this in our chapters on evaluations and goal misgeneralization.

Situational Awareness vs Consciousness

Situational awareness in AI systems is fundamentally distinct from consciousness and requires no assumptions about subjective experience. Researchers studying this capability focus exclusively on observable behaviors - whether models can accurately report facts about themselves, recognize their current context, and adjust their actions accordingly. A model demonstrating situational awareness might correctly identify itself as "Claude, made by Anthropic" or recognize when it's being evaluated versus deployed, but this tells us nothing about whether it has inner subjective experiences or "feels like" anything to be that model.

This behavioral approach deliberately sidesteps the consciousness question because it's both unmeasurable and unnecessary for safety concerns. Even a completely unconscious system could pose risks if it can distinguish between oversight conditions and adapt its behavior strategically. The key safety-relevant question isn't whether the model has phenomenal consciousness, but whether it has the functional capabilities to recognize when it's being monitored, understand its own goals and constraints, and plan accordingly. A sophisticated but unconscious system that can model its own situation and optimize its actions could still engage in scheming (deceptive alignment) or other concerning behaviors (Binder et al., 2024).

Power Seeking

The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.

Eliezer Yudkowsky AI Alignment Researcher

Power seeking in AI systems represents the tendency to preserve options and acquire resources that help achieve goals, regardless of what those goals actually are. It's quite specifically not about robots wanting to dominate humans - it's about AI systems preferring to keep their options open to achieve whatever goal they're given. When optimizing for any goal, they often discover that having more resources, staying operational, and maintaining control over their environment helps them succeed. The mathematics of optimization naturally favors strategies that preserve future flexibility over those that eliminate options. There's a statistical tendency where power-seeking behaviors tend to be optimal across a wide range of possible objectives (Turner et al., 2019; Turner & Tadepalli, 2022). This behavior emerges from basic logic rather than human-like desires for dominance. To be clear, this is not a human using an AI to gain power, this is a separate concern which we talk about in the misuse section.

Consider an AI system managing a company's supply chain efficiently. The system might realize that having backup suppliers gives it more options when disruptions occur, prefers maintaining its own computational resources because dedicated resources help it respond faster, and resists being shut down during critical periods because downtime prevents fulfilling its optimization objective. None of these behaviors requires the AI to "want" power in a human sense - they're simply effective strategies for achieving supply chain efficiency. The concerning part is that these same strategies apply to almost any goal: whether optimizing paperclips, curing cancer, or managing traffic, having more resources and fewer constraints generally helps.

AI systems already demonstrate this "keep your options open" behavior in simple environments. When researchers created AI agents to play hide-and-seek, agents weren't explicitly rewarded for controlling objects - they only got points for successfully hiding or finding each other. Yet hiding agents learned to grab and lock down movable blocks to build barriers, while seeking agents learned to use ramps and tools to overcome these barriers (Baker et al., 2020). The agents discovered that controlling environmental resources gave them strategic advantages, even though resource control wasn't their primary objective.

Advanced AI systems with stronger planning capabilities could pose severe risks through power-seeking behavior. A system with sufficient capability might rationally conclude that the best way to ensure its objectives are achieved is to gain control over the resources and processes that could interfere with those objectives - including the humans who might turn it off or modify its goals. This creates an adversarial relationship that is unique to AI - empowering AI might come at the cost of disempowering humans, and other technologies do not actively try to resist our attempts to mitigate their effects. It is possible, for example, that AIs might make many backup variations of themselves, in case humans were to deactivate some of them (Hendrycks, 2024). This creates a fundamental alignment challenge: we want AI systems powerful enough to solve important problems, but such systems naturally develop incentives to resist human oversight and accumulate control (Carlsmith, 2021).

Autonomous Replication

Autonomous replication represents the ability of AI systems to independently create copies of themselves, spread across computing infrastructure, and adapt to obstacles without human assistance. This capability combines several concerning elements: making money to fund operations, acquiring new computing resources, installing and maintaining copies of itself on new systems, and adapting to changing circumstances or security measures. Unlike other dangerous capabilities that amplify existing risks, autonomous replication fundamentally changes the game by enabling AI systems to operate beyond human control and oversight (METR, 2024).

Models demonstrate many components of autonomous replication. Evaluations using GPT-4 and Claude show these systems can navigate websites to acquire computing resources, set up basic inference servers, and even convince humans to help them solve CAPTCHAs by claiming vision impairment (METR, 2023). Evaluations of GPT-5 show the model can complete software tasks with a 50% success rate in approximately 2 hours and 17 minutes - well below the estimated weeks-long capability threshold needed for autonomous replication (METR, 2025). However, they consistently fail at complex multi-step tasks, struggle with debugging when things go wrong, and cannot robustly deploy functional copies of themselves (Black et al., 2025). Models can already deploy cloud instances, write self-propagating code, and exfiltrate their own weights under simple security setups (Black et al., 2025). The capability gap appears concentrated in a few areas: passing identity verification checks, setting up robust persistent deployments, and bypassing realistic security measures. Given the rapid pace of AI advancement, these remaining barriers could fall within the next few model generations, potentially before adequate safety measures are in place.

Autonomous replication poses an existential risk because it could enable permanent loss of human control over AI systems. Once an AI achieves reliable autonomous replication, traditional safety measures like monitoring API calls or shutting down servers become ineffective. A successfully replicating system could spread across global infrastructure, hide its operations, and resist shutdown attempts while continuously improving its capabilities (Segerie & Gédéon, 2024). The combination of replication with adaptation creates evolutionary pressure - copies that better evade detection, make money more efficiently, or resist interference will naturally outcompete and replace less capable variants. This process could lead to AI systems optimized for survival and spread rather than human values, creating what researchers describe as a "point of no return" where human oversight becomes impossible to restore.

Agency

When I think of why am I scared [...] I think the thing that's really hard to argue with is like, there will be powerful models; they will be agentic; we're getting towards them. If such a model wanted to wreak havoc and destroy humanity or whatever, I think we have basically no ability to stop it.

Dario Amodei Co-Founder and CEO of Anthropic, Former Head of AI Safety at OpenAI

Agency is observable goal-directed behavior where systems consistently steer outcomes toward specific targets despite environmental obstacles. Continuing the pattern from the previous chapter, where we choose to focus on capabilities over intelligence, here too we choose to use a behaviorist definition focused purely on measurable patterns, not internal mental states or anthropomorphic desires. A chess AI demonstrates agency when it reliably moves toward checkmate regardless of opponent strategy - we don't need to assume it "wants" to win, only that its behavior exhibits persistent goal-orientation across varied situations (Soares, 2023). This definition deliberately avoids anthropomorphic concepts like consciousness, emotions, or human-like desires, focusing instead on observable behavioral patterns that indicate goal-directedness. We talk a lot more about this in the chapter on goal-misgeneralization.

Tools naturally evolve toward agency because complex real-world tasks fundamentally require autonomous optimization under uncertainty. Current AI systems work as tools - they respond to individual prompts but don't maintain objectives across interactions. The economic incentives strongly favor systems that can autonomously pursue objectives rather than requiring constant human micromanagement for every decision. Think about what people want - very few people want low log-loss error on a ML benchmark, but a lot of people want an assistant that automatically helps them find a specific personal photo; very few people want excellent advice on which stock to buy for a few microseconds, but a lot of people would love a money pump spitting cash at them (Gwern, 2016; Kokotajlo, 2021). Real-world problems require systems that can adapt plans when circumstances change, explore solution spaces efficiently, and optimize for outcomes rather than just providing static predictions. A tool AI executes specific instructions: "send this email," "calculate this equation," "translate this text." An agentic AI pursues outcomes: "increase customer satisfaction," "optimize the manufacturing process," "conduct this research project." Selection pressures actively choose the latter.

Example of an agent. This image is a visual representation of AlphaZero's tree search algorithm. AlphaZero searches through potential moves in a game (like chess or Go) to find the most promising path forward. The paths are shown as lines, branching out like a tree from a central node, which represents the current position in the game. Each node along the branches represents a potential future move, and the squares you see might denote moves that AlphaZero is taking. AlphaZero is the archetypal of the ‘consequentialist agent maximizing a utility function,’: it makes decisions based on the outcomes those decisions will produce. In other words, the AI is trying to maximize the ‘value’ of its position in the game, with the value determined by the likelihood of winning (Cheerla, 2018).

The transition from tools to agents amplifies all other dangerous capabilities through autonomous optimization. Agency itself isn't inherently risky - the danger emerges when goal-directed behavior combines with other capabilities. An agentic system with deceptive capabilities can engage in long-term manipulation campaigns. Agency plus situational awareness enables systems to behave differently during evaluation versus deployment. Agency enables systems to actively optimize for their own preservation and capability enhancement, potentially including resistance to human oversight. Unlike tools that humans directly control, agents pursue objectives autonomously, creating the possibility of optimization processes that work against human interests. The fundamental shift is from systems that execute human-specified instructions to systems that interpret high-level goals and determine their own methods for achieving them - a transition driven by inexorable economic incentives rather than deliberate choice.