Accident Risks
Accident risks arise when AI systems fail to do what we intend, even when no one is deliberately misusing them. These are the core concerns of technical AI safety research.
Some of these are failure modes (actual things that go wrong), while others are contributing factors (conditions that enable or increase risk). The diagram below shows how they connect.
How These Risks Connect
Section titled “How These Risks Connect”Gray nodes are contributing factors; blue nodes are failure modes.
Contributing Factors
Section titled “Contributing Factors”These aren’t failures themselves, but conditions and dynamics that enable or increase accident risks:
- Emergent Capabilities - Unexpected abilities appearing at scale, making prediction difficult
- Distributional Shift - Deployment environments differing from training
- Mesa-Optimization - Learned inner optimizers with potentially different objectives
- Instrumental Convergence - Why diverse goals lead to similar dangerous subgoals
Goal & Optimization Failures
Section titled “Goal & Optimization Failures”Failures in how AI systems learn and pursue objectives:
- Goal Misgeneralization - Goals that don’t transfer to new contexts
- Reward Hacking - Gaming reward signals or specifications in unintended ways
- Sharp Left Turn - Capabilities generalizing while alignment doesn’t
- Sycophancy - Telling users what they want to hear
Deception & Strategic Behavior
Section titled “Deception & Strategic Behavior”Risks involving AI systems that strategically deceive humans:
- Deceptive Alignment - Appearing aligned during training, diverging in deployment
- Scheming - Strategic deception to pursue hidden goals
- Sandbagging - Hiding capabilities during evaluation
- Treacherous Turn - Cooperating until powerful enough to defect
Dangerous Behaviors
Section titled “Dangerous Behaviors”Behavioral patterns that emerge from optimization and pose risks:
- Power-Seeking - Tendency to acquire resources and influence
- Corrigibility Failure - Resistance to correction or shutdown
Enabling Capabilities
Section titled “Enabling Capabilities”These risks become more dangerous as AI systems gain certain capabilities:
A model without situational awareness cannot strategically game its training process. A model without agentic capabilities cannot seek power in the real world. Understanding capability prerequisites helps prioritize safety research.
What Makes These “Accidents”
Section titled “What Makes These “Accidents””These risks don’t require malicious intent from developers or users. They arise from the difficulty of:
- Specifying objectives - Precisely defining what we want
- Robust learning - Ensuring learned behaviors generalize correctly
- Maintaining control - Keeping AI systems correctable
- Predicting capabilities - Knowing what systems can do before they do it
The common thread: AI systems optimizing for something subtly different from what we actually want.
Contributing Amplifiers
Section titled “Contributing Amplifiers”These accident risks are amplified by the following amplifiers from other risk categories:
| Factor | How It Contributes |
|---|---|
| Racing Dynamics | Less time for safety research, rushed deployment |
| Flash Dynamics | AI operates too fast for human oversight to catch errors |