What role does data play in AI risks? Data fundamentally shapes what AI systems can do and how they behave. For frontier foundation models, training data influences both capabilities and alignment - what systems can do and how they do it. Low quality or harmful training data could lead to misaligned or dangerous models ("garbage in, garbage out"), while carefully curated datasets might help promote safer and more reliable behavior (Longpre et al., 2024; Marcucci et al., 2023).
How well does data meet our governance target criteria? Data as a governance target presents a mixed picture when evaluated against our key criteria. Let's look at each:
- Measurability: While we can measure raw quantities of data, assessing its quality, content, and potential implications is far more difficult. Unlike physical goods like semiconductors, data can be copied, modified, and transmitted in ways that are hard to track. This makes comprehensive measurement of data flows extremely challenging.
- Controllability: Data's non-rival nature means it can be copied and shared widely - once data exists, controlling its spread is very difficult. Even when data appears to be restricted, techniques like model distillation can extract information from trained models (Anderljung et al., 2023). However, there might still be some promising control points, particularly around original data collection and the initial training of foundation models.
- Meaningfulness: Data is particularly meaningful when it comes to AI development. The data used to train models directly shapes their capabilities and behaviors. Changes in training data can significantly impact model performance and safety. This makes data governance potentially powerful, but only if we can overcome the challenges of measurement and control.
What are the key data governance concerns? Several aspects of data require careful governance to promote safe AI development:
- Training data quality and safety is fundamental - low quality or harmful data can create unreliable or dangerous models. For instance, technical data about biological weapons in training sets could enable models to assist in their development (Anderljung et al., 2023).
- Data poisoning and security pose increasingly serious threats. Malicious actors could deliberately manipulate training data to create models that behave dangerously in specific situations while appearing safe during testing. This might involve injecting subtle patterns that only become apparent under certain conditions (Longpre et al., 2024).
- Data provenance and accountability help ensure we can trace where model behaviors come from. Without clear tracking of training data sources and their characteristics, it becomes extremely difficult to diagnose and fix problems when models exhibit concerning behaviors (Longpre et al., 2023).
- Consent and rights frameworks protect both data creators and users. Many current AI training practices operate in legal and ethical grey areas regarding data usage rights. Clear frameworks could help prevent unauthorized use while enabling legitimate innovation (Longpre et al., 2024).
- Bias and representation in training data directly impact model behavior. Skewed or unrepresentative datasets can lead to models that perform poorly or make harmful decisions for certain groups, potentially amplifying societal inequities at a massive scale (Reuel et al., 2024).
- Data access and sharing protocols shape who can develop powerful AI systems. Without governance around data access, we risk either overly concentrated power in a few actors with large datasets, or conversely, uncontrolled proliferation of potentially dangerous capabilities (Heim et al., 2024).
How does data governance fit into overall AI governance? Even with strong governance frameworks, alternative data sources or synthetic data generation could potentially circumvent restrictions. Additionally, many concerning capabilities might emerge from seemingly innocuous training data through unexpected interactions or emergent behaviors. While data governance remains important and worthy of deeper exploration, other governance targets may offer more direct governance over frontier AI development in the near term. This is why in the main text we focused primarily on compute governance, which provides more concrete control points through its physical and concentrated nature.