Risk Models
Models in this section focus on specific alignment failure modes and mechanisms by which AI systems might fail to behave as intended. They decompose complex failure scenarios into component probabilities and identify key conditions.
Models in This Category
Section titled “Models in This Category”- Scheming Likelihood Model: Probability decomposition for AI systems developing deceptive behavior
- Power-Seeking Conditions: Conditions under which AI systems might seek power
- Deceptive Alignment Decomposition: Breaking down deceptive alignment into component scenarios
- Goal Misgeneralization Probability: Analysis of how goals may fail to generalize to new situations
- Mesa-Optimization Analysis: Framework for understanding mesa-optimizers and inner alignment
- Reward Hacking Taxonomy: Classification of ways AI systems might game reward signals
- Corrigibility Failure Pathways: Mechanisms by which corrigibility safeguards might fail