Interpretability
Interpretability research aims to understand what AI systems are “thinking” and why they behave as they do.
Overview:
- InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100: The field and its importance for safety
Mechanistic Approaches:
- Mechanistic InterpretabilityApproachMechanistic InterpretabilityMechanistic interpretability aims to reverse-engineer neural networks to understand internal computations, with $100M+ annual investment across major labs. Anthropic extracted 30M+ features from Cl...Quality: 59/100: Reverse-engineering neural networks
- Sparse AutoencodersApproachSparse Autoencoders (SAEs)Comprehensive review of sparse autoencoders (SAEs) for mechanistic interpretability, covering Anthropic's 34M features from Claude 3 Sonnet (90% interpretability), OpenAI's 16M latent GPT-4 SAEs, D...Quality: 91/100: Learning interpretable features
- ProbingApproachProbing / Linear ProbesLinear probing achieves 71-83% accuracy detecting LLM truthfulness and is a foundational diagnostic tool for interpretability research. While computationally cheap and widely adopted, probes are vu...Quality: 55/100: Testing for specific knowledge or concepts
- Circuit BreakersApproachCircuit Breakers / Inference InterventionsCircuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference. Gray Swan's representation rerouting achieves 87-90% rejection rates with 1% capability l...Quality: 64/100: Identifying and modifying specific circuits
Representation-Based:
- Representation EngineeringApproachRepresentation EngineeringRepresentation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80-95% success in controlled experiments for honesty...Quality: 72/100: Controlling behavior via internal representations