Theoretical Foundations
Theoretical alignment research establishes the conceptual and mathematical foundations for safe AI.
Core Concepts:
- CorrigibilitySafety AgendaCorrigibilityComprehensive review of corrigibility research showing fundamental tensions between goal-directed behavior and shutdown compliance remain unsolved after 10+ years, with 2024-25 empirical evidence r...Quality: 59/100: Systems that allow correction and shutdown
- Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100: When learned goals differ from intended goals
- Agent FoundationsApproachAgent FoundationsAgent foundations research (MIRI's mathematical frameworks for aligned agency) faces low tractability after 10+ years with core problems unsolved, leading to MIRI's 2024 strategic pivot away from t...Quality: 59/100: Mathematical foundations of agency
Scalable Oversight:
- Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100: Supervising superhuman systems
- Eliciting Latent Knowledge (ELK)ApproachEliciting Latent Knowledge (ELK)Comprehensive analysis of the Eliciting Latent Knowledge problem with quantified research metrics: ARC's prize contest received 197 proposals, awarded $274K, but $50K and $100K prizes remain unclai...Quality: 91/100: Getting models to report what they know
- AI DebateApproachAI Safety via DebateAI Safety via Debate uses adversarial AI systems arguing opposing positions to enable human oversight of superhuman AI. Recent empirical work shows promising results - debate achieves 88% human acc...Quality: 70/100: Using AI to verify AI reasoning
Formal Approaches:
- Formal VerificationApproachFormal VerificationFormal verification seeks mathematical proofs of AI safety properties but faces a ~100,000x scale gap between verified systems (~10k parameters) and frontier models (~1.7T parameters). While offeri...Quality: 65/100: Mathematical proofs of properties
- Provably Safe AIApproachProvably Safe AI (davidad agenda)Davidad's provably safe AI agenda aims to create AI systems with mathematical safety guarantees through formal verification of world models and values, primarily funded by ARIA's £59M Safeguarded A...Quality: 65/100: Guarantees through formal methods
- CIRLApproachCooperative IRL (CIRL)CIRL is a theoretical framework where AI systems maintain uncertainty about human preferences, which naturally incentivizes corrigibility and deference. Despite elegant theory with formal proofs, t...Quality: 65/100: Cooperative Inverse Reinforcement Learning
Multi-Agent:
- Cooperative AIApproachCooperative AICooperative AI research addresses multi-agent coordination failures through game theory and mechanism design, with ~$1-20M/year investment primarily at DeepMind and academic groups. The field remai...Quality: 55/100: AI systems that cooperate with humans and each other