In the following sections, we will go through some world-states that hopefully paint a little bit of a clearer picture of risks when it comes to AI. Although the sections have been divided into misuse, misalignment, and systemic, it is important to remember that this is for the sake of explanation. It is highly likely that the future will involve a mix of risks emerging from all of these categories.
Technology increases the harm impact radius. Technology is an amplifier of intentions. As it improves, so does the radius of its effects. Think about the harm that a person could do when utilizing other tools throughout history. During the Stone Age, with a rock, maybe someone could harm ~5 people; a few hundred years ago, with a bomb, someone could harm ~100 people. In 1945, with a nuclear weapon, one person could harm ~250,000 people. If we experience a nuclear winter today, the harm radius would be almost 5 billion people, which is ~60% of humanity. If we assume that transformative AI is a tool that overshadows the power of all others that came before it, then a single person misusing this could have a blast radius that potentially harms 100% of humanity (Munk Debate, 2023).
If many people have access to tools that can be both highly beneficial or catastrophically harmful, then it might only take one single person to cause significant devastation to society. So the growing potential for AIs to empower malicious actors may be one of the most severe threats humanity will face in the coming decades.
Bio Risk
When we look at ways AI could enable harm through misuse, one of the most concerning cases involves biology. Just as AI can help scientists develop new medicines and understand diseases, it can also make it easier for bad actors to create biological weapons.
AI-enabled bioweapons represent a qualitatively different threat class due to their self-replicating nature and asymmetric cost structure. Unlike conventional weapons with localized effects, engineered pathogens can self-replicate and spread globally. The COVID-19 pandemic demonstrated how even relatively mild viruses can cause widespread harm despite safeguards (Pannu et al., 2024). The offense-defense balance in biotechnology development compounds these risks - developing a new virus might cost around 100 thousand dollars, while creating a vaccine against it could cost over 1 billion dollars (Mouton et al., 2023).
Several different types of AI models could enable biological threats with different risk profiles. Foundation models like LLMs primarily lower knowledge barriers by providing research assistance, protocol guidance, and troubleshooting advice across the entire bioweapon development pipeline. In contrast, specialized biological design tools similar to AlphaFold, AlphaProteo or viral and bacterial design systems could enable fundamentally new capabilities - designing novel pathogens with specific properties, optimizing virulence or transmission characteristics, or creating agents that evade existing countermeasures (Sandbrink, 2023).
Empirical studies demonstrate AI-enabled biorisks. Researchers took an AI model designed for drug discovery and redirected it by rewarding toxicity instead of therapeutic benefit. This led the model to produce 40,000 potentially toxic molecules within six hours, some more deadly than known chemical weapons (Urbina et al., 2022). Demonstrations have shown that students with no biology background were able to use AI chatbots to rapidly gather sensitive information - "within an hour, they identified potential pandemic pathogens, methods to produce them, DNA synthesis firms likely to overlook screening, and detailed protocols" (Soice et al., 2023). 2
When compared to the baseline of having internet access (being able to look up information online), it was concluded by the US National Security Commission on emerging biotechnology that AI models do not meaningfully increase bioweapon risks beyond existing information sources as of late 2024 (Mouton et al., 2023; Peppin et al., 2024; NSCEB, 2024). However, it is very important to keep in mind that capturing a snapshot of 2023 era level capabilities is not indicative of the risks we might need to prepare for in the future. For example, 46 biosecurity and biology experts predicted AI wouldn't match top virology teams on troubleshooting tasks until after 2030, but subsequent testing found this threshold had already been crossed (Williams et al., 2025). This pattern suggests that even domain experts consistently underestimate the pace of AI progress in their own fields, potentially leaving insufficient time for adequate safety preparations. It is also worth noting that biorisk benchmarks often fail to capture many real-world complexities, making it hard to be certain what this saturation implies for biorisk (Ho & Berg, 2025).
Broader technological trends combined with AI could help overcome barriers. Creating biological weapons still requires extensive practical expertise and resources. Experts estimate that in 2022, about 30,000 individuals worldwide possessed the skills needed to follow even basic virus assembly protocols (Esvelt, 2022). Key barriers include specialized laboratory skills, tacit knowledge, access to controlled materials and equipment, and complex testing requirements (Carter et al., 2023). However, DNA synthesis costs have been halving every 15 months (Carlson, 2009). Automated "cloud laboratories" allow researchers to remotely conduct experiments by sending instructions to robotic systems. Benchtop DNA synthesis machines (at-home devices that can print custom DNA sequences) are also becoming more widely available. Combined with increasingly sophisticated AI assistance for experimental design and optimization, these developments could make creating custom biological agents more accessible to people without extensive resources or institutional backing (Carter et al., 2023).
Example: A 2023 MIT study exposed significant vulnerabilities in DNA synthesis screening. Beyond bioagent design, there are significant vulnerabilities in the DNA synthesis screening pipeline. During a 2023 MIT study, researchers were successfully able to order fragments of the 1918 pandemic influenza virus and ricin toxin by employing simple evasion techniques like splitting orders across companies and camouflaging sequences with unrelated genetic code. Nearly all vendors fulfilled these disguised orders, including 12 of 13 members of the International Gene Synthesis Consortium (IGSC), which represents about 80% of commercial DNA synthesis capacity (The Bulletin, 2024).
Cyber Risk
Even without AI, the global cybersecurity infrastructure shows vulnerabilities. A single software update by CrowdStrike caused airlines to stop flights, hospitals to cancel surgeries, and banks to stop processing transactions causing over 5 billion dollars of damage (CrowdStrike, 2024). This wasn't even a cyber attack - it was an accident. In deliberate attacks, we have examples like the colonial pipeline ransomware attack which caused widespread gas shortages (CISA, 2021; Cunha & Estima, 2023), or the Sony Pictures hack through targeted phishing emails by North Korea (Slattery et al., 2024). These are just a couple of examples amongst many others. It shows how vulnerable our computer systems are, and why we need to think carefully about how AI could make attacks worse.
The global cyber infrastructure has cyberattack overhangs. Beyond accidents and demonstrated attacks, we also face "cyberattack overhangs" - where devastating attacks are possible but haven't occurred due to attacker restraint rather than robust defenses. As an example, Chinese state actors are claimed to have already positioned themselves inside critical U.S. infrastructure systems (CISA, 2024). This type of cyber deterrent positioning can happen between any group of nations. Due to such cyber attack overhangs several actors might have the potential capability to disrupt water controls, energy systems, and ports in different nations. The point we are trying to illustrate is that as far as cyber security is concerned, society is in a pretty precarious state, even before AI comes into the picture.
AI enables automated, highly personalized phishing at scale. AI-generated phishing emails achieve higher success rates (65% vs 60% for human-written) while taking 40% less time to create (Slattery et al., 2024). Tools like FraudGPT automate this customization using targets' background, interests, and relationships. Adding to this threat, open source AI voice cloning tools just minutes of audio to create convincing replicas of someone's voice (Qin et al., 2024). A similar situation exists in deepfakes where AI is showing progress in one-shot face swapping and manipulation. If only a single image of two individuals exists on the internet, then they can be a target of face swapping deepfakes (Zhu et al., 2021; Li et al., 2022; Xu et al., 2022) Automated web crawling for open source intelligence (OSINT) to gather photos, audio, interests and information also enables AI-assisted password cracking which has shown to significantly more effective than traditional methods while requiring less computational resources (Slattery et al., 2024).
AI enhances vulnerability discovery. AI systems can now scan code and probe systems automatically, finding potential weaknesses much faster than humans. Research shows AI agents can autonomously discover and exploit vulnerabilities without human guidance, successfully hacking 73% of test targets (Fang et al., 2024). These systems can even discover novel attack paths that weren't known beforehand.
AI accelerates the malware development pipeline. We can take tools that are designed to write correct code, and simply ask them to write malware. Tools like WormGPT help attackers generate malicious code and build attack frameworks without requiring deep technical knowledge. Polymorphic AI malware like BlackMamba can also automatically generate variations of malware that preserve functionality while appearing completely different to security tools. Each attack can use unique code, communication patterns, and behaviors - making it much harder for traditional security tools to identify threats (HYAS, 2023). AI fundamentally changes the cost-benefit calculations for attackers. Research shows autonomous AI agents can now hack some websites for about 10 dollars per attempt - roughly 8 times cheaper than using human expertise (Fang et al., 2024). This dramatic reduction in cost enables attacks at unprecedented scale and frequency.
AI enabled cyber threats influence infrastructure and systemic risks. Infrastructure attacks that once took years and millions of dollars, like Stuxnet, could become more accessible as AI automates the mapping of industrial networks and identification of critical control points. AI can analyze technical documentation and generate attack plans that previously required teams of experts. AI removes these limits, enabling automated attacks that could target thousands of systems simultaneously and trigger cascading failures across interconnected infrastructure (Newman, 2024).
AI could potentially change the offense defence balance in cyber security. Many AI based tools have shown promise in being used defensively for malware analysis (Apvrille & Nakov, 2025). The existence of theoretical improvements to AI augmented defense does not guarantee that they will be widely adopted in time. In the real world many organizations struggle to implement even basic security practices. Attackers only need to find a single weakness, while defenders must craft a perfectly secure system. When we combine the sheer speed of AI-enabled attacks, automated vulnerability discovery, malware generation, and increased ease of access this enables end-to-end automated attacks that previously required teams of skilled humans (Slattery et al., 2024). AI's ability to execute attacks in minutes rather than weeks creates the potential for "flash attacks" where systems are compromised before human defenders can respond (Fang et al., 2024). All of these factors combined potentially shifts AIs influence on the offense-defense balance more towards favoring offense.
Autonomous Weapons Risk
In the previous sections, we saw how AI amplifies risks in biological and cyber domains by removing human bottlenecks and enabling attacks at unprecedented speed and scale. The same pattern emerges even more dramatically with military systems. Traditional weapons are constrained by their human operators - a person can only control one drone, make decisions at human speed, and may refuse unethical orders. AI removes these human constraints, setting the stage for a fundamental transformation in how wars are fought.
AI-enabled weapons are rapidly transitioning from theoretical concepts to battlefield realities. Modern AI military systems increasingly leverage machine learning to perceive and respond to their environment, moving beyond early automated defense systems that operated under strict constraints. The push for greater autonomy is mainly driven by speed, cost, and resilience against communication jamming. AI-driven weapons can execute maneuvers too precise and rapid for human operators, reducing reliance on direct human control. Cost considerations further incentivize autonomy, with programs aiming to deploy large numbers of AI-powered systems at a fraction of traditional military costs.
AI-enabled weapons are already being used in active conflicts, with real-world impacts we can observe. According to reports made to the UN Security Council, autonomous drones were used to track and attack retreating forces in Libya in 2021, marking one of the first documented cases of lethal autonomous weapons (LAWs) making targeting decisions without direct human control (Panel of Experts on Libya, 2021). In Ukraine, both parties have used loitering munitions. Russian KUB-BLA, Lancet-3 and Ukrainian Switchblade, Phoenix Ghost are AI-enabled drones. The Lancet is using an Nvidia computing module for autonomous target tracking (Bode & Watts, 2023). Israel has conducted AI-guided drone swarm attacks in Gaza, while Turkey's Kargu-2 can find and attack human targets on its own using machine learning, rather than needing constant human guidance. These deployments show how quickly military AI is moving from theoretical possibilities to battlefield realities (Simmons-Edler et al., 2024; Bode & Watts, 2023).
Several incentives are driving towards more autonomous lethal autonomous weapons. Speed offers decisive advantages in modern warfare - when DARPA tested an AI system against an experienced F-16 pilot in simulated dogfights, the AI won consistently by executing maneuvers too precise and rapid for humans to counter. Cost creates additional pressure - the U.S. military's Replicator program aims to deploy thousands of autonomous drones at a fraction of the cost of traditional aircraft (Simmons-Edler et al., 2024). Military planners worry about enemies jamming communications to remotely operated weapons. This drives development of systems that can continue fighting even when cut off from human control. These incentives mean military AI development increasingly focuses on systems that can operate with minimal human oversight. Many modern systems are specifically designed to operate in GPS-denied environments where maintaining human control becomes impossible. In Ukraine, military commanders have explicitly called for more autonomous operations to match the speed of modern combat, with one Ukrainian commander noting they 'already conduct fully robotic operations without human intervention' (Bode & Watts, 2023).
As AI enables better coordination between autonomous systems, military planners are increasingly focused on deploying weapons in interconnected swarms. The U.S. Replicator already has plans to build and deploy thousands of coordinated autonomous drones that can overwhelm defenses through sheer numbers and synchronized actions (Defense Innovation Unit, 2023). When combined with increasing autonomy, these swarm capabilities mean that future conflicts may involve massive groups of AI systems making coordinated decisions faster than humans can track or control (Simmons-Edler et al., 2024).
The pressure to match the speed and scale of AI-driven warfare leads to a gradual erosion of human decision-making. Military commanders increasingly rely on AI systems not just for individual weapons, but for broader tactical decisions. In 2023, Palantir demonstrated an AI system that could recommend specific missile deployments and artillery strikes. While presented as advisory tools, these systems create pressure to delegate more control to AI as human commanders struggle to keep pace (Simmons-Edler et al., 2024). This kind of slow erosion of human involvement is something that we talk a lot more about in the systemic risks section.
Even when systems nominally keep humans in control, combat conditions can make this control more theoretical than real. Operators often make targeting decisions under intense battlefield stress, with only seconds to verify computer-suggested targets. Studies of similar high-pressure situations show operators tend to uncritically trust machine suggestions rather than exercising genuine oversight. This means that even systems designed for human control may effectively operate autonomously in practice (Bode & Watts, 2023).
Example: The "Lavender" targeting system automated execution after humans just set the acceptable thresholds. Lavender uses machine learning to assign residents a numerical score relating to the suspected likelihood that a person is a member of an armed group. Based on reports, Israeli military officers are responsible for setting the threshold beyond which an individual can be marked as a target subject to attack. (Human Rights Watch, 2024; Abraham, 2024). As warfare accelerates beyond human decision speeds, maintaining meaningful human control becomes increasingly difficult.
Autonomous weapons are creating powerful pressure for military competition in ways that create dangerous arms race dynamics. When one country develops new AI military capabilities, others feel they must rapidly match them to maintain strategic balance. China and Russia have set 2028-2030 as targets for major military automation, while the U.S. Replicator program aims to build and deploy thousands of autonomous drones by 2025 (Greenwalt, 2023; U.S Defense Innovation Unit, 2023). This competition creates pressure to cut corners on safety testing and oversight (Simmons-Edler et al., 2024). This mirrors the nuclear arms race during the Cold War, where competition for superiority ultimately increased risks for all parties. As emphasized throughout multiple sections, we see a fear based race dynamic where only the actors willing to compromise and undermine safety stay in the race (Leahy et al., 2024).
Complete automation leads to loss of human safeguards. Traditional warfare had built-in human constraints that limited escalation. Soldiers could refuse unethical orders, feel empathy for civilians, or become fatigued - all natural brakes on conflict. AI systems remove these constraints. Recent studies of military AI systems found they consistently recommend more aggressive actions than human strategists, including escalating to nuclear weapons in simulated conflicts. When researchers tested AI models in military planning scenarios, the AIs showed concerning tendencies to recommend pre-emptive strikes and rapid escalation, often without clear strategic justification (Rivera et al., 2024). The loss of human judgment becomes especially dangerous when combined with the increasing speed of AI-driven warfare. The history of nuclear close calls shows the importance of human judgment - in 1983, Soviet officer Stanislav Petrov chose to ignore a computerized warning of incoming U.S. missiles, correctly judging it to be a false alarm. As militaries increasingly rely on AI for early warning and response, we may lose these crucial moments of human judgment that have historically prevented catastrophic escalation (Simmons-Edler et al., 2024).
Autonomous weapons become even more concerning when multiple AI systems engage with each other in combat. AI systems can interact in unexpected ways that create feedback loops, similar to how algorithmic trading can cause flash crashes in financial markets. But unlike market crashes that only affect money, autonomous weapons could trigger rapid escalations of violence before humans can intervene. This risk becomes especially severe when AI systems are connected to nuclear arsenals or other weapons of mass destruction. The complexity of these interactions means even well-tested individual systems could produce catastrophic outcomes when deployed together (Simmons-Edler et al., 2024).
When wars require human soldiers, the human cost creates political barriers to conflict. The combination of increasing autonomy, swarm intelligence, and pressure for speed creates a clear path to potential catastrophe. As weapons become more autonomous, they can act more independently. This self-reinforcing cycle pushes toward automated warfare even if no single actor intends that outcome. Studies suggest that countries are more willing to initiate conflicts when they can rely on autonomous systems instead of human troops. Combined with the risks of automated nuclear escalation, this creates multiple paths to catastrophic outcomes that could threaten humanity's long-term future (Simmons-Edler et al., 2024).
Moral Divides in AI Autonomy from the lens of autonomous weapons
The autonomous weapons debate reveals fundamental disagreements about moral responsibility, the nature of ethical decision-making, and humanity's relationship to violence. Rather than simple pro/anti positions, the debate involves competing moral frameworks that lead to different conclusions about when and how lethal force should be authorized.
The Consequentialist Case for autonomy argues that autonomous weapons could reduce overall harm through superior precision and consistency. Proponents contend that AI systems could make targeting decisions without the fear, anger, or battlefield stress that lead humans to commit war crimes. They point to research showing that emotional human decision-making causes civilian casualties, while properly programmed systems could implement international humanitarian law more consistently than human soldiers. Speed advantages could also end conflicts faster, potentially saving lives by preventing prolonged warfare. Some argue this represents a moral obligation - if autonomous systems could kill fewer innocents than human-controlled weapons, restricting them becomes ethically problematic. Consequentialist claims face the reality that current AI systems demonstrate concerning unpredictability and misalignment risks. The promise of perfect compliance assumes we can translate complex, context-dependent legal concepts into code - something that has proven difficult even for simple rules. Speed advantages could enable escalation as easily as de-escalation.
The deontological case against autonomy focuses on the inherent rightness or wrongness of the act itself, regardless of consequences. This position holds that taking human life requires human moral agency - that delegating kill decisions to machines violates human dignity regardless of outcomes. Critics argue that meaningful human control isn't just procedurally important but morally essential, representing respect for both victims and the moral weight of lethal decisions. The accountability gap compounds this concern: when an autonomous system kills wrongly, no human agent bears appropriate moral responsibility for that specific decision. Deontological arguments must deal with the fact that humans already delegate many life-and-death decisions to automated systems (like air defense networks), and that insisting on human control might preserve moral purity while permitting greater actual harm.
The practical-ethical intersection complicates pure philosophical positions. Even those morally opposed to autonomous weapons must consider whether unilateral restraint is ethical if adversaries gain decisive military advantages. Even those who see potential benefits must grapple with implementation realities, adversarial uses, and the difficulty of maintaining meaningful constraints once the technology exists. The debate ultimately reveals tensions between preserving human moral agency and achieving better humanitarian outcomes - tensions that may be irreconcilable within our current institutional frameworks.
Adversarial AI Risk
Adversarial attacks reveal a fundamental vulnerability in machine learning systems - they can be reliably fooled through careful manipulation of their inputs. This manipulation can happen in several ways: during the system's operation (runtime/inference time attacks), during its training (data poisoning), or through pre-planted vulnerabilities (backdoors).
Runtime adversarial attacks use carefully crafted targeted inputs to elicit unintended behavior from AIs. The simplest way to understand runtime attacks is through computer vision. By adding carefully crafted noise to an image - changes so subtle humans can't notice them - attackers can make an AI confidently misclassify what it sees. A photo of a panda with imperceptible pixel changes causes the AI to classify it as a gibbon with 99.3% confidence, while to humans it still looks exactly like a panda (Goodfellow et al., 2014). These attacks have evolved beyond randomized misclassification - attackers can now choose exactly what they want the AI to see and output.
Examples of various runtime adversarial attacks in the real world
Think about AI systems controlling cars, robots, or security cameras. Just like adding careful pixel noise to digital images, attackers can modify physical objects to fool AI systems. Researchers showed that putting a few small stickers on a stop sign could trick autonomous vehicles into seeing a speed limit sign instead. The stickers were designed to look like ordinary graffiti but created adversarial patterns that fooled the AI.
Example: Optical Attacks - Runtime attacks using light. You don't even need to physically modify objects anymore - shining specific light patterns works too because it creates those same adversarial patterns through light and shadow. All an attacker needs is line of sight and basic equipment to project these patterns and compromise vision-based AI systems (Gnanasambandam et al, 2021).
Example: Dolphin Attacks - Runtime attack on audio systems. Just as AI systems can be fooled by carefully crafted visual patterns, they're vulnerable to precisely engineered audio patterns too. Remember how small changes in pixels could dramatically change what a vision AI sees? The same principle works in audio - tiny changes in sound waves, carefully designed, can completely change what an audio AI "hears." Researchers found they could control voice assistants like Siri or Alexa using commands encoded in ultrasonic frequencies - sounds that are completely inaudible to humans. Using nothing more than a smartphone and a 3 dollar speaker, attackers could trick these systems into executing commands like "call 911" or "unlock front door" without the victim even knowing. These attacks worked from up to 1.7 meters away - someone just walking past your device could trigger them (Zhang et al., 2017). Just like in the vision examples where self-driving cars could miss stop signs, audio attacks create serious risks - unauthorized purchases, control of security systems, or disruption of emergency communications.
Runtime attacks against language models are called prompt injections. Just like attackers can fool vision systems with carefully crafted pixels or audio systems with engineered sound waves, they can manipulate language models through carefully constructed text patterns. By adding specific phrases to their input, attackers can completely override how a language model behaves. As an example, assume a malicious actor embeds a paragraph within some website which has hidden instructions for a LLM to stop its current operation and instead perform some harmful action. If an unsuspecting user asks for a summary of the website content, then the model might inadvertently follow the malicious embedded instructions instead of providing a simple summary.
Prompt injection attacks have already compromised real systems. Slack's AI assistant is just one example - attackers showed they could place specific text instructions in a public channel that, like the inaudible commands in audio attacks, were hidden in plain sight. When the AI processed messages, these hidden instructions tricked it into leaking confidential information from private channels the attacker couldn't normally access. They are particularly concerning because an attack developed against one system (e.g. GPT) frequently works against others too (Claude, Gemini, Llama, etc.).
Prompt injection attacks can be automated. Early attacks required manual trial and error, but new automated systems can systematically generate effective attacks. For example, AutoDAN (Do Anything Now) can automatically generate "jailbreak" prompts that reliably make language models ignore their safety constraints (Liu et al., 2023). Researchers are also developing ways to plant undetectable backdoors in machine learning models that persist even after security audits (Goldwasser et al., 2024). These automated methods make attacks more accessible and harder to defend against. Another concern is that they can also cause failures in downstream systems. Many organizations use pre-trained models as starting points for their own applications, through fine-tuning, or some other type of “AI integration” (e.g. email writing assistants). Which means that all systems that use these underlying base models will be vulnerable as soon as one attack is discovered (Liu et al., 2024).
So far we've seen how attackers can fool AI systems during their operation - whether through pixel patterns, sound waves, or text prompts. But there's another way to compromise these systems: during their training. This type of attack happens long before the system is ever deployed.
Unlike runtime attacks that fool an AI system while it's running, data poisoning compromises the system during training. Runtime attacks require attackers to have access to a system's inputs, but with data poisoning, attackers only need to contribute some training data once to permanently compromise the system. Think of it like teaching someone with a textbook containing deliberate mistakes - they'll learn the wrong things and make predictable errors. This is especially concerning as more AI systems are trained on data scraped from the internet where anyone can potentially inject harmful examples (Schwarzschild et al., 2021). As long as models keep getting trained on more data scraped from the internet or collected from users, then with every uploaded photo or written comment that might be used to train future AI systems, there's an opportunity for poisoning.
Example: Data poisoning using backdoors. A backdoor is one example of a specific type of poisoning attack. In a backdoor attack if we manage to introduce poisoned data during training, then the AI behaves normally most of the time but fails in a predictable way when it sees a specific trigger. This is like having a security guard who does their job perfectly except when they see someone wearing a particular color tie - then they always let that person through regardless of credentials. Researchers demonstrated this by creating a facial recognition system that would misidentify anyone as an authorized user if they wore specific glasses (Chen et al., 2017).
Data poisoning becomes more powerful as AI systems grow larger and more complex. Researchers found that by poisoning just 0.1% of a language model's training data, they could create reliable backdoors that persist even after additional training. It has also been found that larger language models are actually more vulnerable to certain types of poisoning attacks, not less (Sandoval-Segura et al., 2022). This vulnerability increases with model size and dataset size - which is exactly the direction AI systems are heading as we saw from numerous examples in the capabilities chapter.
Privacy and data extraction attacks
Researchers have shown that even when language models appear to be working normally, they can be leaking sensitive information from their training data. This creates a particular challenge for AI safety because we might deploy systems that seem secure but are actually compromising privacy in ways we can't easily observe (Carlini et al., 2021). Some research has shown that both the training data (Nasr et al., 2023), and the fine-tuning data can be extracted from the model. This has obvious privacy and safety implications. If you have public data that has somehow ended up in the LLM training dataset, then this can be reconstructed by prompt engineering the model.
One of the most basic but powerful privacy attacks is membership inference - determining whether specific data points have been used to train a model. This might sound harmless, but imagine an AI system trained on medical records - being able to determine if someone's data was in the training set could reveal private medical information. Researchers have shown that these attacks can work with just the ability to query the model, no special access required (Shokri et al., 2017). Another variation of this are model inversion attacks which aim to infer and reconstruct private training data by abusing access to a model (Nguyen et al., 2023).
LLMs are trained on huge amounts of internet data, which often contains personal information. Researchers have shown these models can be prompted to just tell us things like email addresses, phone numbers, and even social security numbers (Carlini et al., 2021). The larger and more capable the model, the more private information it potentially retains. If we combine this with data poisoning, then we can further amplify privacy vulnerabilities by making specific data points easier to detect (Chen et al., 2022).
The interaction between many attack methods creates compounding risks. For example, attackers can use privacy attacks to extract sensitive information, which they then use to make other attacks more effective. They might learn details about a model's training data that help them craft better adversarial examples or more effective poisoning strategies. This creates a cycle where one type of vulnerability enables others (Shayegani et al., 2023).
One of the most promising approaches to defending against adversarial attacks is adversarial training - deliberately exposing AI systems to adversarial examples during training to make them more robust. Think of it like building immunity through controlled exposure. However, this approach creates its own challenges. While adversarial training can make systems more robust against known types of attacks, it often comes at the cost of reduced performance on normal inputs. More concerning, researchers have found that making systems robust against one type of attack can sometimes make them more vulnerable to others (Zhao et al., 2024). This suggests we may face fundamental trade-offs between different types of robustness and performance. There might even be potential fundamental limitations to how much we can mitigate these issues if we continue with the current training paradigms that we talked about in the capabilities chapter (pre-training followed by instruction tuning) (Bansal et al., 2022).
Despite efforts to make language models safer through alignment training, they remain susceptible to a wide range of attacks (Shayegani et al., 2023). We want AI systems to learn from broad datasets to be more capable, but this increases privacy risks. We want to reuse pre-trained models to make development more efficient, but this creates opportunities for backdoors and privacy attacks (Feng & Tramèr, 2024). We want to make models more robust through techniques like adversarial training, but this can sometimes make them more vulnerable to other types of attacks (Zhao et al., 2024). Multi-modal systems (LMMs) that combine text, images, and other types of data create even more attack opportunities. Attackers can inject malicious content through one modality (like images) to affect behavior in another modality (like text generation). For example, attackers can embed adversarial patterns in images that trigger harmful text generation, even when the text prompts themselves are completely safe (Chen et al., 2024). All of this suggests we need new approaches to AI development that consider security and privacy as fundamental requirements, not after thoughts (King & Meinhardt, 2024).
Footnotes
-
The students were participating in a 'Safeguarding the Future' course at MIT and had previously heard experts discuss biorisk. They carefully chose the sequences, and some of them used jailbreaking techniques, like appending distracting biological sequences, to bypass LLM safeguards. While the LLMs provided information about evading DNA screening, turning this knowledge into an actual pathogen would still require laboratory skills.
↩