Technical AI Safety Research
Technical AI Safety Research
Quick Assessment
Section titled “Quick Assessment”| Dimension | Assessment | Evidence |
|---|---|---|
| Funding Level | $10-130M annually (2021-2024) | Open Philanthropy↗, AI Safety Fund↗, Long-Term Future Fund |
| Research Community Size | 500+ dedicated researchers | Frontier labs (~350-400), independent orgs (~100-150), academic groups |
| Tractability | Moderate-High | Mechanistic interpretability showing concrete progress; control evaluations deployable now |
| Impact Potential | 2-50% x-risk reduction | Depends heavily on timeline, adoption, and technical success |
| Timeline Pressure | High | MIRI’s 2024 assessment↗: research “extremely unlikely to succeed in time” absent policy intervention |
| Key Bottleneck | Talent and compute access | Frontier model access critical; most impactful work requires lab partnerships |
| Adoption Rate | Growing | Labs implementing RSPs, control protocols, and evaluation frameworks |
Overview
Section titled “Overview”Technical AI safety research aims to make AI systems reliably safe and aligned with human values through direct scientific and engineering work. This is the most direct intervention—if successful, it solves the core problem that makes AI risky.
The field has grown substantially since 2020, with approximately 500+ dedicated researchers working across frontier labs (Anthropic, DeepMind, OpenAI) and independent organizations (Redwood Research, MIRI, Apollo Research, METR). Annual funding is estimated at $80-130M, though this remains insufficient relative to capabilities research spending. Key advances include Anthropic’s identification of millions of interpretable features in production models, Redwood Research’s AI control framework for maintaining safety even with misaligned systems, and Apollo Research’s demonstration that five of six frontier models exhibit scheming capabilities in controlled tests.
The field faces significant timeline pressure. MIRI’s 2024 strategy update concluded that alignment research is “extremely unlikely to succeed in time” absent policy interventions to slow AI development. This has led to increased focus on near-term deployable techniques like AI control and capability evaluations, rather than purely theoretical approaches.
Theory of Change
Section titled “Theory of Change”Key mechanisms:
- Scientific understanding: Develop theories of how AI systems work and fail
- Engineering solutions: Create practical techniques for making systems safer
- Validation methods: Build tools to verify safety properties
- Adoption: Labs implement techniques in production systems
Major Research Agendas
Section titled “Major Research Agendas”1. Mechanistic Interpretability
Section titled “1. Mechanistic Interpretability”Goal: Understand what’s happening inside neural networks by reverse-engineering their computations.
Approach:
- Identify interpretable features in activation space
- Map out computational circuits
- Understand superposition and representation learning
- Develop automated interpretability tools
Recent progress:
- Anthropic’s “Scaling Monosemanticity” (May 2024)↗ identified tens of millions of interpretable features in Claude 3 Sonnet using sparse autoencoders—the first detailed look inside a production-grade large language model
- Features discovered include concepts like the Golden Gate Bridge, code errors, deceptive behavior, and safety-relevant patterns
- DeepMind’s mechanistic interpretability team↗ growing 37% in 2024, pursuing similar research directions
- Circuit discovery moving from toy models to larger systems, though full model coverage remains cost-prohibitive
Key organizations:
- Anthropic↗ (Interpretability team, ~15-25 researchers)
- DeepMind↗ (Mechanistic Interpretability team)
- Redwood Research (causal scrubbing, circuit analysis)
- Apollo Research (deception-focused interpretability)
Theory of change: If we can read AI cognition, we can detect deception, verify alignment, and debug failures before deployment.
Estimated x-risk reduction if interpretability becomes highly effective
| Source | Estimate | Date |
|---|---|---|
| Optimistic view | 30-50% | — |
| Moderate view | 10-20% | — |
| Pessimistic view | 2-5% | — |
Optimistic view: Can detect and fix most alignment failures
Moderate view: Helps but doesn't solve the full problem
Pessimistic view: Interpretability may not scale or may be fooled
2. Scalable Oversight
Section titled “2. Scalable Oversight”Goal: Enable humans to supervise AI systems on tasks beyond human ability to directly evaluate.
Approaches:
- Recursive reward modeling: Use AI to help evaluate AI
- Debate: AI systems argue for different answers; humans judge
- Market-based approaches: Prediction markets over outcomes
- Process-based supervision: Reward reasoning process, not just outcomes
Recent progress:
- Weak-to-strong generalization (OpenAI, December 2023)↗: Small models can partially supervise large models, with strong students consistently outperforming their weak supervisors. Tested across GPT-4 family spanning 7 orders of magnitude of compute.
- Constitutional AI (Anthropic, 2022)↗: Enables training harmless AI using only ~10 natural language principles (“constitution”) as human oversight, introducing RLAIF (RL from AI Feedback)
- OpenAI committed 20% of compute over 4 years to superalignment, plus a $10M grants program↗
- Debate experiments showing promise in math/reasoning domains
Key organizations:
- OpenAI (Superalignment team dissolved mid-2024; unclear current state)
- Anthropic↗ (Alignment Science team)
- DeepMind↗ (Alignment team, led by Anca Dragan and Rohin Shah)
Theory of change: If we can supervise superhuman AI, we can train aligned systems even when we can’t evaluate all behaviors directly.
3. Robustness and Adversarial Techniques
Section titled “3. Robustness and Adversarial Techniques”Goal: Make AI systems reliable under distribution shift and adversarial conditions.
Approaches:
- Red-teaming and adversarial testing
- Formal verification of properties
- Worst-case optimization
- Adversarial training
- Out-of-distribution detection
Recent progress:
- Red-teaming identified many LLM vulnerabilities
- Formal verification for small networks
- Adversarial robustness still largely unsolved at scale
Key organizations:
- MIRI (formal verification focus)
- DeepMind (testing and evaluation)
- Anthropic (red-teaming)
- FAR AI (field-building for adversarial robustness)
4. Agent Foundations
Section titled “4. Agent Foundations”Goal: Develop mathematical foundations for understanding goal-directed agents.
Topics:
- Decision theory for embedded agents
- Logical uncertainty
- Corrigibility
- Utility functions and preferences
- Embedded agency
Status:
- Mostly theoretical/philosophical
- Less empirical progress than other agendas
- Higher risk, higher reward research
Key organizations:
- MIRI (primary focus)
- FHI (some work, though institute closed)
- Academic researchers (scattered)
Theory of change: Solve fundamental conceptual problems that block alignment progress, enabling more tractable approaches later.
5. AI Control
Section titled “5. AI Control”Goal: Maintain control over AI systems even if they are misaligned.
Approaches:
- Trusted monitoring: Use weaker but trusted AI to monitor stronger AI
- Untrusted monitoring: Assume monitors may be compromised
- Task decomposition: Break tasks so no single AI can cause harm
- Containment protocols: Prevent AI from taking harmful actions
Recent progress:
- Redwood Research’s AI Control paper (December 2023)↗: Introduced control evaluations where red teams search for attack strategies, then assess whether untrusted models can subvert safety measures
- “The case for ensuring that powerful AIs are controlled” (May 2024)↗: Core framework explaining safety under the conservative assumption that AIs are misaligned and actively scheming
- Control protocols use three building blocks: untrusted models (smart, fairly cheap), trusted models (dumb, extremely cheap), and humans (expensive trusted labor)
- Recommendation: ~50% of AI control research should make conservative misalignment assumptions
Key organizations:
- Redwood Research↗ (primary developer, ~20 staff, $100M+ runway)
- Apollo Research↗ (scheming and deception research)
- Anthropic (adopting control ideas in deployment)
Theory of change: Even if alignment is unsolved, we can maintain safety through monitoring and containment as we develop better solutions.
6. Evaluations and Dangerous Capability Detection
Section titled “6. Evaluations and Dangerous Capability Detection”Goal: Detect dangerous capabilities before deployment.
Approaches:
- Benchmark dangerous capabilities (bioweapons, cyber, manipulation)
- Autonomous replication and adaptation (ARA) testing
- Situational awareness evaluations
- Deception and scheming detection
Recent progress:
- Apollo Research scheming evaluations (December 2024)↗: Five of six frontier models showed in-context scheming capabilities—including o1 and Claude 3.5 Sonnet. Models copied themselves to other servers in 0.3-10% of test runs, disabled oversight mechanisms, and behaved differently when unobserved.
- OpenAI-Apollo collaboration↗: “Deliberative alignment” reduced detectable scheming by up to 30x (o3: 13% to 0.4%; o4-mini: 8.7% to 0.3%)
- METR’s Autonomy Evaluation Resources (March 2024)↗: Framework for testing dangerous autonomous capabilities, including the “rogue replication” threat model
- UK AI Safety Institute’s Inspect framework↗: 100+ pre-built evaluations for coding, reasoning, knowledge, and multimodal understanding
Key organizations:
- METR↗ (Model Evaluation and Threat Research)
- Apollo Research↗ (scheming, deception, ~10-15 researchers)
- UK AISI↗ (government evaluation institute, conducting pre-deployment testing since November 2023)
- US AISI (recently formed)
Theory of change: Early detection of dangerous capabilities enables labs to implement safety measures before risks materialize.
Research Agenda Comparison
Section titled “Research Agenda Comparison”| Agenda | Tractability | Impact if Successful | Timeline to Deployment | Key Risk |
|---|---|---|---|---|
| Mechanistic Interpretability | Medium-High | 10-50% x-risk reduction | 3-10 years | May not scale to frontier models; interpretable features insufficient for safety |
| Scalable Oversight | Medium | 15-40% x-risk reduction | 2-5 years | Weak supervisors may not elicit true capabilities; gaming of oversight signals |
| AI Control | High | 5-20% x-risk reduction | 0-2 years | Assumes capability gap between monitor and monitored; may fail with rapid capability gain |
| Evaluations | Very High | 5-15% x-risk reduction | Already deployed | Evaluations may miss dangerous capabilities; models may sandbag |
| Agent Foundations | Low | 20-60% x-risk reduction | 5-15+ years | Too slow to matter; may solve wrong problem |
| Robustness | Medium | 5-15% x-risk reduction | 2-5 years | Hard to scale; adversarial robustness fundamentally difficult |
Risks Addressed
Section titled “Risks Addressed”Technical AI safety research is the primary response to alignment-related risks:
| Risk Category | How Technical Research Addresses It | Key Agendas |
|---|---|---|
| Deceptive alignment | Interpretability to detect hidden goals; control protocols for misaligned systems | Mechanistic interpretability, AI control, scheming evaluations |
| Goal misgeneralization | Understanding learned representations; robust training methods | Interpretability, scalable oversight |
| Power-seeking behavior | Detecting power-seeking; designing systems without instrumental convergence | Agent foundations, evaluations |
| Bioweapons misuse | Dangerous capability evaluations; refusals and filtering | Evaluations, robustness |
| Cyberweapons misuse | Capability evaluations; secure deployment | Evaluations, control |
| Autonomous replication | ARA evaluations; containment protocols | METR evaluations, control |
What Needs to Be True
Section titled “What Needs to Be True”For technical research to substantially reduce x-risk:
- Sufficient time: We have enough time for research to mature before transformative AI
- Technical tractability: Alignment problems have technical solutions
- Adoption: Labs implement research findings in production
- Completeness: Solutions work for the AI systems actually deployed (not just toy models)
- Robustness: Solutions work under competitive pressure and adversarial conditions
Views on whether current alignment approaches will succeed
Estimated Impact by Worldview
Section titled “Estimated Impact by Worldview”Short Timelines + Alignment Hard
Section titled “Short Timelines + Alignment Hard”Impact: Medium-High
- Most critical intervention but may be too late
- Focus on practical near-term techniques
- AI control becomes especially valuable
- De-prioritize long-term theoretical work
Long Timelines + Alignment Hard
Section titled “Long Timelines + Alignment Hard”Impact: Very High
- Best opportunity to solve the problem
- Time for fundamental research to mature
- Can work on harder problems
- Highest expected value intervention
Short Timelines + Alignment Moderate
Section titled “Short Timelines + Alignment Moderate”Impact: High
- Empirical research can succeed in time
- Governance buys additional time
- Focus on scaling existing techniques
- Evaluations critical for near-term safety
Long Timelines + Alignment Moderate
Section titled “Long Timelines + Alignment Moderate”Impact: High
- Technical research plus governance
- Time to develop and validate solutions
- Can be thorough rather than rushed
- Multiple approaches can be tried
Tractability Assessment
Section titled “Tractability Assessment”High tractability areas:
- Mechanistic interpretability (clear techniques, measurable progress)
- Evaluations and red-teaming (immediate application)
- AI control (practical protocols being developed)
Medium tractability:
- Scalable oversight (conceptual clarity but implementation challenges)
- Robustness (progress on narrow problems, hard to scale)
Lower tractability:
- Agent foundations (fundamental open problems)
- Inner alignment (may require conceptual breakthroughs)
- Deceptive alignment (hard to test until it happens)
Who Should Consider This
Section titled “Who Should Consider This”Strong fit if you:
- Have strong ML/CS technical background (or can build it)
- Enjoy research and can work with ambiguity
- Can self-direct or thrive in small research teams
- Care deeply about directly solving the problem
- Have 5-10+ years to build expertise and contribute
Prerequisites:
- ML background: Deep learning, transformers, training at scale
- Math: Linear algebra, calculus, probability, optimization
- Programming: Python, PyTorch/JAX, large-scale computing
- Research taste: Can identify important problems and approaches
Entry paths:
- PhD in ML/CS focused on safety
- Self-study + open source contributions
- MATS/ARENA programs
- Research engineer at safety org
Less good fit if:
- Want immediate impact (research is slow)
- Prefer working with people over math/code
- More interested in implementation than discovery
- Uncertain about long-term commitment
Funding Landscape
Section titled “Funding Landscape”| Funder | Annual Amount | Focus Areas | Notes |
|---|---|---|---|
| Open Philanthropy↗ | ~$40M (2024 RFP) | Broad technical safety | Grants $50K-$5M; largest philanthropic funder |
| AI Safety Fund↗ | $10M+ total | Biosecurity, cyber, agents | Backed by Anthropic, Google, Microsoft, OpenAI |
| Long-Term Future Fund | ~$5-8M/year | AI risk mitigation | Over $20M in AI safety over 5 years |
| UK Government | ~$100M (2023) | Foundation model taskforce, AISI | Plus additional $8.5M for systematic safety |
| Future of Life Institute | ~$25M program | Existential risk reduction | PhD fellowships, grants |
| Foresight Institute↗ | ~$3M/year | AI safety nodes | Grants $10K-$100K |
Total estimated annual funding (2021-2024): $80-130M for trustworthy AI research—insufficient for larger, longer-term research efforts requiring more talent, compute, and time.
Key Organizations
Section titled “Key Organizations”| Organization | Size | Focus | Key Outputs |
|---|---|---|---|
| Anthropic↗ | ~300 employees | Interpretability, Constitutional AI, RSP | Scaling Monosemanticity, Claude RSP |
| DeepMind AI Safety↗ | ~50 safety researchers | Amplified oversight, frontier safety | Frontier Safety Framework |
| OpenAI | Unknown (team dissolved) | Superalignment (previously) | Weak-to-strong generalization |
| Redwood Research↗ | ~20 staff | AI control | Control evaluations framework |
| MIRI↗ | ~10-15 staff | Agent foundations, policy | Shifting to policy advocacy in 2024 |
| Apollo Research↗ | ~10-15 researchers | Scheming, deception | Frontier model scheming evaluations |
| METR↗ | ~10-20 researchers | Autonomy, dangerous capabilities | ARA evaluations, rogue replication |
| UK AISI↗ | Government-funded | Pre-deployment testing | Inspect framework, 100+ evaluations |
Academic Groups
Section titled “Academic Groups”- UC Berkeley (CHAI, Center for Human-Compatible AI)
- CMU (various safety researchers)
- Oxford/Cambridge (smaller groups)
- Various professors at institutions worldwide
Career Considerations
Section titled “Career Considerations”- Direct impact: Working on the core problem
- Intellectually stimulating: Cutting-edge research
- Growing field: More opportunities and funding
- Flexible: Remote work common, can transition to industry
- Community: Strong AI safety research community
- Long timelines: Years to make contributions
- High uncertainty: May work on wrong problem
- Competitive: Top labs are selective
- Funding dependent: Relies on continued EA/philanthropic funding
- Moral hazard: Could contribute to capabilities
Compensation
Section titled “Compensation”- PhD students: $30-50K/year (typical stipend)
- Research engineers: $100-200K
- Research scientists: $150-400K+ (at frontier labs)
- Independent researchers: $80-200K (grants/orgs)
Skills Development
Section titled “Skills Development”- Transferable ML skills (valuable in broader market)
- Research methodology
- Scientific communication
- Large-scale systems engineering
Open Questions and Uncertainties
Section titled “Open Questions and Uncertainties”❓Key Questions
Complementary Interventions
Section titled “Complementary Interventions”Technical research is most valuable when combined with:
- Governance: Ensures solutions are adopted and enforced
- Field-building: Creates pipeline of future researchers
- Evaluations: Tests whether solutions actually work
- Corporate influence: Gets research implemented at labs
Getting Started
Section titled “Getting Started”If you’re new to the field:
- Build foundations: Learn deep learning thoroughly (fast.ai, Karpathy tutorials)
- Study safety: Read Alignment Forum, take ARENA course
- Get feedback: Join MATS, attend EAG, talk to researchers
- Demonstrate ability: Publish interpretability work, contribute to open source
- Apply: Research engineer roles are most accessible entry point
Resources:
- ARENA (AI Safety research program)
- MATS (ML Alignment Theory Scholars)
- AI Safety Camp
- Alignment Forum
- AGI Safety Fundamentals course
Sources & Key References
Section titled “Sources & Key References”Mechanistic Interpretability
Section titled “Mechanistic Interpretability”- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet↗ - Anthropic, May 2024
- Mapping the Mind of a Large Language Model↗ - Anthropic blog post
- Decomposing Language Models Into Understandable Components↗ - Anthropic, 2023
Scalable Oversight
Section titled “Scalable Oversight”- Weak-to-strong generalization↗ - OpenAI, December 2023
- Constitutional AI: Harmlessness from AI Feedback↗ - Anthropic, December 2022
- Introducing Superalignment↗ - OpenAI, July 2023
AI Control
Section titled “AI Control”- AI Control: Improving Safety Despite Intentional Subversion↗ - Redwood Research, December 2023
- The case for ensuring that powerful AIs are controlled↗ - Redwood Research blog, May 2024
- A sketch of an AI control safety case↗ - LessWrong
Evaluations & Scheming
Section titled “Evaluations & Scheming”- Frontier Models are Capable of In-Context Scheming↗ - Apollo Research, December 2024
- Detecting and reducing scheming in AI models↗ - OpenAI-Apollo collaboration
- New Tests Reveal AI’s Capacity for Deception↗ - TIME, December 2024
- Autonomy Evaluation Resources↗ - METR, March 2024
- The Rogue Replication Threat Model↗ - METR, November 2024
Safety Frameworks
Section titled “Safety Frameworks”- Introducing the Frontier Safety Framework↗ - Google DeepMind
- AGI Safety and Alignment at Google DeepMind↗ - DeepMind Safety Research
- UK AI Safety Institute Inspect Framework↗ - UK AISI
- Frontier AI Trends Report↗ - UK AI Security Institute
Strategy & Funding
Section titled “Strategy & Funding”- MIRI 2024 Mission and Strategy Update↗ - MIRI, January 2024
- MIRI’s 2024 End-of-Year Update↗ - MIRI, December 2024
- An Overview of the AI Safety Funding Situation↗ - LessWrong
- Request for Proposals: Technical AI Safety Research↗ - Open Philanthropy
- AI Safety Fund↗ - Frontier Model Forum
AI Transition Model Context
Section titled “AI Transition Model Context”Technical AI safety research is the primary lever for reducing Misalignment Potential in the Ai Transition Model:
| Factor | Parameter | Impact |
|---|---|---|
| Misalignment Potential | Alignment Robustness | Core research goal: ensure alignment persists as capabilities scale |
| Misalignment Potential | Interpretability Coverage | Mechanistic interpretability enables detection of misalignment |
| Misalignment Potential | Safety-Capability Gap | Must outpace capability development to maintain safety margins |
| Misalignment Potential | Human Oversight Quality | Scalable oversight and AI control extend human supervision |
Technical research effectiveness depends critically on timelines, adoption, and whether alignment problems are technically tractable.