Skip to content

Theoretical Foundations

Theoretical alignment research establishes the conceptual and mathematical foundations for safe AI.

Core Concepts:

Corrigibility: Systems that allow correction and shutdown
Goal Misgeneralization: When learned goals differ from intended goals
Agent Foundations: Mathematical foundations of agency

Scalable Oversight:

Scalable Oversight: Supervising superhuman systems
Eliciting Latent Knowledge (ELK): Getting models to report what they know
AI Debate: Using AI to verify AI reasoning

Formal Approaches:

Formal Verification: Mathematical proofs of properties
Provably Safe AI: Guarantees through formal methods
CIRL: Cooperative Inverse Reinforcement Learning

Multi-Agent:

Cooperative AI: AI systems that cooperate with humans and each other