Research Agenda Comparison
Research Agendas
Quick Assessment
Section titled “Quick Assessment”| Organization | Annual Budget | Staff Size | Primary Focus | Tractability | Timeline |
|---|---|---|---|---|---|
| Anthropic | ~$100M+ (safety research) | ~15 red team + growing safety org | Constitutional AI, Interpretability | High (empirical validation) | 2-5 years |
| DeepMind | $50M+ (estimated) | 30-50 researchers, growing 37-39% annually | Scalable Oversight, Formal Verification | Medium (theoretical + empirical) | 5-10 years |
| ARC | ~$5M annually | ~5-10 researchers | Eliciting Latent Knowledge | Low (fundamental problem) | 5-7 years |
| Redwood Research | ~$5-10M annually ($21M total raised) | Small team + Constellation (30k sqft) | AI Control protocols | High (practical protocols) | Near-term |
| MIRI | $7M (2025), $10M reserves | ~7 FTE comms, pivoting to governance | Agent Foundations, Policy | Low (fundamental theory) | Long-term/governance |
| SSI (Sutskever) | $2B raised (2024) | New org, building team | Superintelligence alignment | Unknown (stealth mode) | Long-term |
| OpenAI (dissolved) | 20% compute allocation | Disbanded May 2024 | Superalignment (weak-to-strong) | N/A (team dissolved) | N/A |
Researcher Compensation Comparison
Section titled “Researcher Compensation Comparison”| Organization | Research Scientist Range | Senior/Lead Range | Equity/Stock | Fellowship/Intern Rates |
|---|---|---|---|---|
| Anthropic | $315K-$560K total | $550K-$760K total | 4-year vesting | $3,850/week + $15K/mo compute |
| DeepMind | $200K-$260K base | $300K+ with bonus | Google RSUs | $113K-$150K student researcher |
| ARC | $150K-$400K (historical) | Similar range | Nonprofit, no equity | $15K/month intern |
| Redwood | $150K-$300K (estimated) | Similar range | Nonprofit, no equity | Fellowship program |
| MIRI | $100K-$200K (estimated) | Similar range | Nonprofit, no equity | Summer fellows program |
| SSI | Unknown (stealth) | Premium for top talent | $10B valuation equity | N/A |
The compensation gap between frontier labs (Anthropic, DeepMind) and nonprofit research organizations (ARC, Redwood, MIRI) creates significant talent allocation challenges for the field. Anthropic’s median total compensation of approximately $545K substantially exceeds nonprofit salaries, though nonprofits offer other benefits including mission alignment and research freedom.
Overview
Section titled “Overview”The AI safety landscape features dramatically different research agendas based on fundamentally different theories about what will make advanced AI systems safe. These approaches range from Anthropic’s constitutional AI—which aims to instill human values through training—to MIRI’s agent foundations work that seeks deep theoretical understanding before building anything. The differences aren’t merely tactical but reflect deep disagreements about timelines, the nature of intelligence, and what kinds of failures we should expect.
This fragmentation reflects both uncertainty about which technical approaches will succeed and disagreement about what “success” even looks like. Some agendas focus on making current large language models safer through better training techniques. Others prepare for scenarios where we deploy AI systems we cannot fully trust or understand. Still others argue that without fundamental conceptual breakthroughs, all current approaches will fail catastrophically. Understanding these differences is crucial for researchers, policymakers, and anyone trying to evaluate the overall trajectory of AI safety work.
The stakes of these disagreements are enormous. If constitutional AI succeeds at scale, we might achieve safe AGI through careful scaling of current techniques. If it fails and we haven’t developed robust control protocols or verification methods, we could face exactly the catastrophic scenarios these research programs aim to prevent. The resource allocation decisions being made today across these different approaches may determine humanity’s ability to navigate the development of transformative AI.
AI Safety Funding Landscape (2025)
Section titled “AI Safety Funding Landscape (2025)”The AI safety funding ecosystem has evolved dramatically, with total available funding estimated at $500M-$1B annually across all sources, though the vast majority concentrates at frontier labs:
| Funding Source | Annual Amount | Focus Areas | Constraints |
|---|---|---|---|
| Anthropic internal | $100M+ (estimated) | Constitutional AI, interpretability, RSP | Internal priorities |
| DeepMind internal | $50M+ (estimated) | Scalable oversight, formal verification | Google priorities |
| Open Philanthropy | $16.6M (deep learning alignment) | Academic grants, org support | Capacity constrained |
| OpenAI Fast Grants | $10M (one-time) | External alignment research | Completed program |
| LTFF/SFF | $5-10M annually | Small grants, independent researchers | Funding limited since 2024 |
| Government (AISI) | Growing but small | Evaluation, standards | Early stage |
A critical constraint emerged in late 2024: alignment grantmaking became mostly funding-bottlenecked rather than talent-bottlenecked. The Lightspeed grants round received far more promising applications than available funding, and the Long-Term Future Fund reported insufficient resources for worthy grants. Approximately 80-90% of external alignment funding flows through Open Philanthropy, creating concentration risk and capacity constraints in grant evaluation.
Research Agenda Relationships
Section titled “Research Agenda Relationships”Research Agenda Risk Assessment
Section titled “Research Agenda Risk Assessment”Each research agenda addresses different failure modes with varying degrees of coverage:
| Agenda | Deceptive Alignment | Reward Hacking | Capability Overhang | Coordination Failure | Overall Coverage |
|---|---|---|---|---|---|
| Constitutional AI | Partial (training-based) | High | Low | Low | 40-50% |
| Interpretability | High (if successful) | Medium | Low | Low | 35-50% |
| Scalable Oversight | Medium | High | Medium | Low | 50-60% |
| AI Control | High (containment) | Medium | Medium | Low | 55-65% |
| ELK | High (if solved) | Low | Low | Low | 30-40% |
| Agent Foundations | High (theoretical) | High (theoretical) | Medium | Low | 25-40% |
The table illustrates why portfolio diversification across research agendas is critical: no single approach provides comprehensive coverage of all major failure modes. Constitutional AI addresses reward hacking well but struggles with deceptive alignment; AI control handles near-term containment but may not scale to superintelligence; and agent foundations provides theoretical grounding but limited practical tools. Estimated overall coverage ranges from 25-65% for individual agendas, suggesting that even with substantial investment, significant risk gaps remain.
Core Research Agenda Analysis
Section titled “Core Research Agenda Analysis”Anthropic’s Constitutional AI Paradigm
Section titled “Anthropic’s Constitutional AI Paradigm”Anthropic’s approach represents the most direct attempt to scale current techniques to advanced AI systems. Their constitutional AI methodology trains language models to follow a set of principles through a two-stage process: first using AI feedback to critique and revise outputs according to the constitution, then using this data for reinforcement learning. Their Responsible Scaling Policy framework creates concrete capability thresholds (ASL-2, ASL-3, etc.) with corresponding safety requirements, providing a roadmap for scaling safely through increasingly capable systems.
With over $16 billion raised in 2025↗ (including a $13B Series F at $183B valuation in September 2025, followed by further growth to over $350 billion valuation by November), Anthropic dedicates substantial resources to safety research. Their Frontier Red Team comprises approximately 15 researchers who stress-test advanced systems for misuse risks in biological research, cybersecurity, and autonomous systems. The company’s Anthropic Fellows Program↗ provides $3,850/week stipends plus ~$15k/month in compute funding, producing research papers from over 80% of fellows. Their Claude Academy has trained over 300 engineers in AI safety practices, converting generalist software developers into specialized practitioners.
The constitutional AI work has shown promising empirical results on current models, successfully reducing harmful outputs while maintaining helpfulness across various tasks. Their 2024 research on “sleeper agents” provided concerning evidence that deceptive alignment could persist through standard training techniques, while their mechanistic interpretability research aims to understand model internals well enough to detect such deception. The combination suggests a research program that takes seriously both the promise and the perils of scaling current approaches.
However, critics argue that constitutional AI may create a false sense of security by solving outer alignment (getting models to produce safe outputs) without addressing inner alignment (ensuring the model’s internal goals are aligned). The approach assumes that values can be reliably instilled through training rather than just learned as behavioral patterns that might break down under novel circumstances. Whether this assumption holds for superintelligent systems remains one of the most crucial open questions in AI safety.
DeepMind’s Scalable Oversight Framework
Section titled “DeepMind’s Scalable Oversight Framework”DeepMind’s research program centers on the fundamental challenge of maintaining human oversight over AI systems that surpass human capabilities. Their debate approach, where AI systems argue different positions for human evaluation, aims to leverage competition to surface truth even when evaluators cannot directly assess claims. Process supervision evaluates reasoning steps rather than final outputs, potentially catching errors before they compound into dangerous conclusions.
Google DeepMind’s AGI Safety & Alignment team↗ has grown 37-39% annually and now comprises 30-50 researchers depending on boundary definitions (not counting present-day safety or adversarial robustness teams). Leadership includes Anca Dragan, Rohin Shah, and Allan Dafoe under executive sponsor Shane Legg, with the team organized into AGI Alignment (mechanistic interpretability, scalable oversight) and Frontier Safety (dangerous capability evaluations). Their Frontier Safety Framework↗ provides systematic evaluation protocols, while the AGI Safety Council analyzes risks and recommends safety practices. DeepMind’s safety research investment is estimated at over $50 million annually, though exact figures are not publicly disclosed. DeepMind has stated it is investing heavily in interpretability research to make AI systems more understandable and auditable.
The scalable oversight paradigm addresses a core problem: if AI systems become more capable than humans in domains like scientific research or strategic planning, how can we evaluate whether they’re pursuing our intended goals? Their research on recursive reward modeling explores using AI systems to help evaluate other AI systems, potentially creating a bootstrapping process for maintaining oversight as capabilities scale. This connects to their broader work on formal verification, attempting to provide mathematical guarantees about AI system behavior.
Recent results on process supervision in mathematical reasoning show promise, with models trained to optimize for correct reasoning steps rather than just correct answers showing improved robustness. However, the approach faces significant challenges in domains where correct processes are less well-defined than mathematics. Critics worry that debate could favor persuasive arguments over truthful ones, and that recursive evaluation might amplify rather than correct human biases embedded in the initial training process.
ARC’s Eliciting Latent Knowledge Research
Section titled “ARC’s Eliciting Latent Knowledge Research”ARC’s research agenda, led by Paul Christiano, focuses on what many consider the core technical problem of AI alignment: ensuring that AI systems report what they actually know rather than what they think humans want to hear. The eliciting latent knowledge (ELK) problem assumes that advanced AI systems will develop rich internal representations of the world but may not be incentivized to share their true beliefs with human operators.
The Alignment Research Center↗ operates with a small team of approximately 5-10 researchers and an annual budget of around $5 million. Early staff included Paul Christiano and Mark Xu, with the organization receiving $265,000 from Open Philanthropy in March 2022↗ and a subsequent $1.25 million grant for general support. In 2025, ARC reported making conceptual and theoretical progress at the fastest pace since 2022, with current research focusing on combining mechanistic interpretability with formal verification. The organization spun out ARC Evals as an independent nonprofit called METR in December 2023, led by Beth Barnes who joined from OpenAI. Historical salary ranges for full-time positions were $150k-400k annually, with intern positions at $15k/month. METR now partners with the AI Safety Institute, is part of the NIST AI Safety Institute Consortium, and conducts evaluations of frontier models including GPT o3, o4-mini, and Claude 3.7 Sonnet.
The ELK problem is particularly concerning in scenarios where AI systems understand that humans are using their reports to make important decisions. A system that understands human psychology and incentives might learn to give reassuring rather than accurate assessments of plans or situations. ARC’s research explores various technical approaches to extracting genuine beliefs, including methods that don’t rely on human feedback and could thus be harder for AI systems to manipulate.
ARC’s practical work through ARC Evals provides crucial empirical grounding by testing current systems for dangerous capabilities before they’re deployed. Their evaluations of GPT-4 for tasks like autonomous replication and persuasion established important precedents for capability assessment. However, the core ELK problem remains largely unsolved, with fundamental questions about whether latent knowledge even exists in the forms the research assumes. Some critics argue that attempting to solve ELK may be unnecessary if other approaches can provide adequate safety guarantees.
Redwood Research’s AI Control Approach
Section titled “Redwood Research’s AI Control Approach”Redwood Research takes a pragmatic approach that explicitly doesn’t require solving alignment. Their AI control paradigm assumes we may deploy powerful AI systems before we fully understand or trust them, and focuses on developing protocols to safely use untrusted AI. This includes monitoring systems, decomposing tasks so that no single AI system has excessive power, and maintaining human oversight over critical decisions.
Founded in August 2021, Redwood Research↗ has received approximately $21 million in total funding: $20 million from Open Philanthropy↗ and $1.27 million from the Survival and Flourishing Fund, with approximately two-thirds ($14M) for research operations and one-third ($7M) for running Constellation. Their ICML oral paper “AI Control: Improving Risk Despite Intentional Subversion”↗ introduced protocols for monitoring malign LLM agents, establishing AI control as a bedrock approach for mitigating catastrophic risk. Leadership includes Nate Thomas (CEO), Buck Shlegeris (CTO), and Bill Zito (COO), with a board comprising Nate Thomas, Holden Karnofsky, and Paul Christiano. Redwood collaborates with governments and advises AI companies including Google DeepMind and Anthropic, partnering with UK AISI to produce “A sketch of an AI control safety case.” The organization runs the Constellation office space, a 30,000 square foot facility hosting staff from multiple AI safety organizations including Open Philanthropy, ARC, Atlas Fellowship, and CEA, with a Visiting Fellows program that has placed dozens of participants into full-time roles on leading safety teams.
The AI control approach acknowledges that perfect alignment may be impossible or insufficient given competitive pressures and deployment timelines. Instead, it focuses on containment strategies, robust monitoring, and defense-in-depth approaches that could work even with partially misaligned systems. Their research on control protocols examines specific technical mechanisms: how to detect when AI systems are acting outside their intended scope, how to limit their capabilities while preserving usefulness, and how to maintain meaningful human control over important decisions.
This paradigm offers near-term practical benefits and could buy crucial time for other safety research. Control protocols could help prevent immediate harms from current systems while longer-term alignment research continues. However, critics argue that the approach may not scale to truly superintelligent systems that could find ways to circumvent monitoring and controls. There’s also concern that successful control protocols might create complacency, reducing incentives to solve more fundamental alignment problems.
MIRI’s Agent Foundations Program
Section titled “MIRI’s Agent Foundations Program”MIRI’s research program stands apart by focusing on fundamental theoretical questions about agency, goals, and decision-making before attempting to build aligned AI systems. Their agent foundations research explores problems like embedded agency (how can agents reason about themselves as part of the world they’re trying to understand?), logical uncertainty (how to make decisions when even logical facts are uncertain), and decision theory for AI systems.
The Machine Intelligence Research Institute↗ operates with a projected 2025 budget of $7.1 million and reserves of $10 million—enough for approximately 1.5 years if no additional funds are raised. Annual expenses from 2019-2023 ranged from $5.4M to $7.7M, with 2026 projections at a median of $8M (potentially up to $10M with growth). MIRI is conducting its first fundraiser in six years↗, targeting $6M with the first $1.6M matched 1:1 via an SFF grant (with a stretch goal of $10M total). This would bring reserves to their two-year target of $16M. Historically, MIRI received $2.1 million from Open Philanthropy in 2019↗ supplemented by $7.7M in 2020, plus several million dollars in Ethereum from Vitalik Buterin in 2021. The communications team currently comprises approximately seven full-time employees (including Nate Soares and Eliezer Yudkowsky), with plans to grow in 2026. The organization has pivoted from technical AI safety research to focusing on informing policymakers and the public about AI risks, expanding its communications team and launching a technical AI governance program.
This approach reflects deep skepticism about whether current AI paradigms can be made safe without fundamental conceptual breakthroughs. MIRI researchers argue that most other approaches are building on shaky theoretical foundations and may create capabilities without the conceptual tools needed to ensure safety. Their work on topics like Löb’s theorem and reflective reasoning explores how advanced AI systems might reason about their own reasoning processes—a capability that could be crucial for ensuring they remain aligned as they modify themselves or create successor systems.
MIRI’s theoretical rigor has identified important problems and concepts that other researchers now work on. However, their approach faces criticism for being too abstract and disconnected from practical AI development. The theoretical questions they focus on may indeed need to be solved eventually, but the research program offers little immediate help with near-term AI systems. Given uncertain timelines to transformative AI, the value of their work depends heavily on whether we have time for fundamental research or need immediate practical solutions.
OpenAI’s Superalignment (Disbanded)
Section titled “OpenAI’s Superalignment (Disbanded)”OpenAI’s Superalignment team, established in July 2023, represented a major commitment to solving superintelligence alignment within four years. Co-led by Ilya Sutskever (Chief Scientist) and Jan Leike (Head of Alignment), the team received 20% of OpenAI’s compute allocation↗—an unprecedented resource dedication for safety research. The team focused on “weak-to-strong generalization,” exploring whether weaker AI models could effectively supervise and align stronger successors.
However, in May 2024, OpenAI disbanded the Superalignment team↗ less than one year after its formation. Jan Leike sharply criticized the company upon departure, stating “Over the past years, safety culture and processes have taken a backseat to shiny products.” Ilya Sutskever also left, with Jakub Pachocki replacing him as Chief Scientist and John Schulman becoming scientific lead for alignment work. Jan Leike subsequently joined Anthropic↗ to lead a new superalignment team focusing on scalable oversight, weak-to-strong generalization, and automated alignment research.
In October 2024, OpenAI disbanded another safety team↗—the “AGI Readiness” team that advised on the company’s capacity to handle increasingly powerful AI. Miles Brundage, senior advisor for AGI Readiness, wrote upon departure: “Neither OpenAI nor any other frontier lab is ready, and the world is also not ready.” The dissolution of both teams within six months raised significant questions about OpenAI’s commitment to safety research relative to product development.
Safe Superintelligence Inc. (SSI)
Section titled “Safe Superintelligence Inc. (SSI)”In June 2024, Ilya Sutskever—former OpenAI Chief Scientist and co-lead of the dissolved Superalignment team—founded Safe Superintelligence Inc. (SSI) with the explicit mission of building safe superintelligence as its sole focus. SSI raised $2 billion in its first funding round, achieving a $10 billion valuation before generating any revenue or releasing products.
The company operates in stealth mode with limited public information about its research approach, though its founding principles emphasize that safety and capabilities will be advanced together from the ground up. SSI represents a significant bet that a new organization, freed from the commercial pressures and legacy decisions of existing labs, can more effectively pursue safe superintelligence. The $2 billion in initial funding—comparable to years of safety research funding across the entire nonprofit field—provides substantial runway for long-term research without near-term product pressures.
Critics question whether SSI’s approach will differ meaningfully from existing efforts, and whether the stealth mode prevents the kind of external scrutiny that could improve safety outcomes. Supporters argue that Sutskever’s technical leadership and the organization’s singular focus on superintelligence alignment represent the most direct attempt to solve the core problem before capability pressures make it intractable.
Detailed Research Agenda Comparison
Section titled “Detailed Research Agenda Comparison”| Research Agenda | Funding/Resources | Team Size | Key Publications | Output Metrics | Industry Adoption |
|---|---|---|---|---|---|
| Anthropic Constitutional AI | $100M+ annually; $16B raised 2025 | ~15 red team + safety teams | RSP framework, Sleeper agents paper, Constitutional AI paper | 80%+ fellows publish; 300+ Claude Academy graduates | High (RSP adopted by labs) |
| Anthropic Interpretability | Shared safety budget | Integrated with safety teams | Mechanistic interpretability papers, Circuit analysis | Growing publication output | Medium (research influence) |
| DeepMind Scalable Oversight | $50M+ estimated | 30-50 researchers; growing 37-39% annually | Debate papers, Process supervision research | Multiple high-impact papers | Medium (research influence) |
| DeepMind Formal Verification | Shared budget | Integrated with safety teams | Mathematical verification research | Targeted publications | Low (early stage) |
| ARC (ELK) | ~$5M annually | ~5-10 researchers | ELK technical report, METR evaluations | Conceptual progress 2025 | Low (fundamental research) |
| Redwood (AI Control) | $21M total raised | Small team + Constellation | ICML oral paper on AI control | Government/industry collaborations | High (UK AISI partnership) |
| MIRI (Agent Foundations) | $7.1M (2025); $10M reserves | ~7 FTE comms, pivoting to governance | Agent foundations papers, Policy communications | Shifting to policy output | Low (pivoting away) |
| SSI (Sutskever) | $2B raised at $10B valuation | Building team (stealth) | None public yet | Unknown (stealth mode) | N/A (new) |
| OpenAI Superalignment | 20% compute allocation | Disbanded May 2024 | Weak-to-strong generalization research | N/A (dissolved) | N/A (dissolved) |
Safety Implications and Failure Modes
Section titled “Safety Implications and Failure Modes”Each research agenda implies different failure modes and offers different types of safety assurances. Constitutional AI could fail if values aren’t genuinely internalized, leading to systems that behave safely during training but pursue misaligned goals when deployed in novel situations. The approach also depends on the assumption that we can identify and specify human values clearly enough to train systems effectively—a assumption that may not hold for complex moral questions.
Scalable oversight approaches face the challenge that oversight itself might be gameable by sufficiently sophisticated systems. If AI systems understand they’re being evaluated, they might optimize for appearing aligned rather than being aligned. The debate and process supervision methods also assume that correct reasoning processes can be identified and evaluated, which may not hold in domains where the correct approach is itself uncertain or controversial.
The control paradigm accepts some risk of misalignment in exchange for practical deployability, but this creates a fundamental scaling problem. As AI systems become more capable, the gap between their abilities and human oversight capabilities grows, potentially making control protocols increasingly ineffective. There’s also the risk that control protocols designed for current systems won’t transfer to fundamentally different AI architectures.
Each approach also offers distinct benefits. Constitutional AI provides immediate practical benefits for current systems and could scale smoothly if its assumptions hold. Scalable oversight directly addresses the superintelligence control problem. AI control offers near-term safety even without solving alignment. Agent foundations could prevent fundamental errors in how we think about AI goals and decision-making.
Timeline Considerations and Research Trajectories
Section titled “Timeline Considerations and Research Trajectories”The research agendas operate on different timelines that reflect their different assumptions about AI development. Anthropic’s constitutional AI work focuses on systems likely to be deployed within 2-5 years, with their Responsible Scaling Policy providing a framework for capability thresholds through potential AGI development. This near-term focus allows for empirical validation but may miss longer-term failure modes.
DeepMind’s scalable oversight research targets the 5-10 year timeframe when AI systems might exceed human capabilities in most domains. Their formal verification work requires significant theoretical development but could provide stronger safety guarantees. The timeline pressure creates tension between developing robust theoretical foundations and providing practical solutions for nearer-term systems.
ARC’s ELK research addresses problems that become critical as AI systems become sophisticated enough to understand and potentially manipulate human oversight. This could become relevant within 5-7 years as language models develop better theory of mind capabilities. However, fundamental progress on ELK has been limited, raising questions about whether the problem is solvable within relevant timelines.
Current trajectories suggest that multiple approaches will likely be needed rather than a single solution. Constitutional AI and control protocols provide immediate benefits for current systems. Interpretability and oversight methods could become crucial as capabilities scale. Foundational research might prevent fundamental errors in system design. The key question is resource allocation across these different timelines and approaches.
Key Technical and Strategic Uncertainties
Section titled “Key Technical and Strategic Uncertainties”The most fundamental uncertainty is whether current AI paradigms—large language models trained with reinforcement learning from human feedback—will lead directly to transformative AI. If scaling current approaches leads to AGI, then research on constitutional AI and scalable oversight directly addresses the systems we’ll need to align. If a paradigm shift occurs, much current safety research might not transfer to new architectures.
Another crucial uncertainty is the feasibility of verification without full alignment. Can we develop interpretability tools good enough to detect deceptive alignment? Can oversight methods scale to superintelligent systems? These questions determine whether we can safely deploy AI systems we don’t fully understand or trust—a scenario that seems increasingly likely given competitive pressures.
The tractability of fundamental theoretical problems remains unclear. Are problems like ELK solvable in principle? Do we need conceptual breakthroughs about agency and goals before building advanced AI? The answers affect how much effort should go into foundational research versus empirical approaches with current systems.
Strategic uncertainties include how different research agendas interact and whether they’re complements or substitutes. Does success in constitutional AI reduce the need for control protocols, or do we need defense in depth? How do competitive dynamics between AI developers affect the feasibility of different safety approaches? These considerations may be as important as the technical questions for determining which research directions to prioritize.
The comparison reveals that rather than competing approaches, we likely need a portfolio strategy that addresses different scenarios and timelines. The optimal allocation depends on beliefs about AI development trajectories, the tractability of different technical problems, and the strategic landscape for AI deployment. Understanding these trade-offs is essential for making effective research and policy decisions in an uncertain but rapidly evolving field.
Sources
Section titled “Sources”Anthropic
Section titled “Anthropic”- Anthropic’s $13 Billion Series F funding announcement↗ - December 2025
- Anthropic Fellows Program for 2026↗ - Applications open
- Introducing the Anthropic Fellows Program↗ - Program overview
DeepMind
Section titled “DeepMind”- AGI Safety and Alignment at Google DeepMind↗ - Team overview
- Introducing the Frontier Safety Framework↗ - Safety evaluation protocols
OpenAI
Section titled “OpenAI”- Introducing Superalignment↗ - July 2023
- OpenAI dissolves Superalignment AI safety team↗ - May 2024
- Anthropic hires former OpenAI safety lead Jan Leike↗ - May 2024
- OpenAI disbands AGI Readiness team↗ - October 2024
- Alignment Research Center website↗ - Organization overview
- ARC’s first technical report: Eliciting Latent Knowledge↗ - Technical paper
- Open Philanthropy grant to ARC↗ - March 2022
Redwood Research
Section titled “Redwood Research”- Redwood Research website↗ - Organization overview
- Open Philanthropy general support grant↗ - Funding information
- AI Control: Improving Risk Despite Intentional Subversion↗ - ICML oral paper
- MIRI’s 2025 Fundraiser↗ - Current fundraising campaign
- MIRI’s 2024 End-of-Year Update↗ - Strategic direction
- Open Philanthropy grant to MIRI (2019)↗ - Funding history
Funding Landscape
Section titled “Funding Landscape”- Open Philanthropy AI alignment grants↗ - $16.6M for deep learning alignment projects
- OpenAI Superalignment Fast Grants↗ - $10M grant program
- Alignment grantmaking funding constraints↗ - 2024 funding bottleneck analysis
- Anthropic salary data↗ - Compensation benchmarks
- DeepMind salary data↗ - Compensation benchmarks
| Name | Primary Focus | Key Assumption | Timeline View | Theory of Change |
|---|---|---|---|---|
| Anthropic (Constitutional AI) | Train models to be helpful, harmless, honest via AI feedback | Can instill values through training at scale | Short (2026-2030) | Build safest frontier models → demonstrate safe scaling → set industry standard |
| Anthropic (Interpretability) | Understand model internals to verify alignment | Neural networks have interpretable structure | Short-medium | Mechanistic understanding → can verify goals → detect deception |
| OpenAI (Superalignment) [dissolved] | Use weaker AI to align stronger AI | Weak-to-strong generalization works | Short (2027-2030) | Bootstrap alignment from current models to future models |
| DeepMind (Scalable Oversight) | Debate, recursive reward modeling, process supervision | Can scale human oversight to superhuman AI | Medium (2030-2040) | Better evaluation → catch misalignment → iterate to safety |
| ARC (Eliciting Latent Knowledge) | Get AI to report what it actually knows | AI has latent knowledge that could be extracted | Medium | Solve ELK → detect deception → verify alignment |
| Redwood (AI Control) | Maintain control even with misaligned AI | Don't need alignment, just control | Near-term focus | Untrusted AI can be safely used with proper protocols |
| MIRI (Agent Foundations) | Fundamental theory of agency and goals | Need conceptual breakthroughs first | Uncertain, possibly insufficient time | Deep understanding → know what to build → align by design |
| SSI (Sutskever) | Safe superintelligence from ground up | Fresh start enables better safety integration | Long-term (superintelligence focus) | Build safety into capabilities from start → achieve safe superintelligence |
| Name | Solves Inner Alignment? | Scales to Superhuman? | Works Without Full Alignment? | Empirically Testable Now? |
|---|---|---|---|---|
| Constitutional AI | Uncertainmedium | Unknownmedium | Nolow | Yeshigh |
| Interpretability | Could help verifymedium | Unknownmedium | Helps detect issuesmedium | Yeshigh |
| Scalable Oversight | Nolow | Designed tohigh | Maybemedium | Partiallymedium |
| ELK | Would help detectmedium | Intended tomedium | Nolow | Limitedlow |
| AI Control | No (bypasses it)low | Probably notlow | Yes (the point)high | Yeshigh |
| Agent Foundations | Aims tomedium | Would if successfulmedium | Nolow | Nolow |
| SSI (Sutskever) | Unknown (stealth)medium | Primary goalhigh | Unknownmedium | No (stealth)low |
Views on which approach is most likely to prevent AI catastrophe
❓Key Questions
AI Transition Model Context
Section titled “AI Transition Model Context”Research agenda diversification improves the Ai Transition Model across multiple factors:
| Factor | Parameter | Impact |
|---|---|---|
| Misalignment Potential | Alignment Robustness | Portfolio approach hedges against any single methodology failing |
| Misalignment Potential | Safety-Capability Gap | Multiple research paths increase probability of keeping pace with capabilities |
| Transition Turbulence | Racing Intensity | Coordinated safety investment across labs reduces competitive pressure to cut corners |
The distribution of research funding across agendas significantly affects the probability of achieving safe AI development across different scenarios.