Skip to content

Research Agenda Comparison

📋Page Status
Quality:82 (Comprehensive)
Importance:85 (High)
Last edited:2025-12-28 (10 days ago)
Words:4.4k
Structure:
📊 6📈 1🔗 46📚 06%Score: 11/15
LLM Summary:Comprehensive analysis comparing major AI safety research agendas (Anthropic, DeepMind, ARC, Redwood, MIRI) across budget, staff, timeline, and tractability dimensions. Provides quantified organizational data ($100M+ for Anthropic, $50M+ for DeepMind) and identifies fundamental theoretical disagreements that shape resource allocation decisions for reducing existential risk.
Key Crux

Research Agendas

Importance85
FocusComparing approaches to AI alignment
Key TensionEmpirical vs. theoretical, prosaic vs. novel
Related ToAlignment Difficulty, Timelines
Related
OrganizationAnnual BudgetStaff SizePrimary FocusTractabilityTimeline
Anthropic~$100M+ (safety research)~15 red team + growing safety orgConstitutional AI, InterpretabilityHigh (empirical validation)2-5 years
DeepMind$50M+ (estimated)30-50 researchers, growing 37-39% annuallyScalable Oversight, Formal VerificationMedium (theoretical + empirical)5-10 years
ARC~$5M annually~5-10 researchersEliciting Latent KnowledgeLow (fundamental problem)5-7 years
Redwood Research~$5-10M annually ($21M total raised)Small team + Constellation (30k sqft)AI Control protocolsHigh (practical protocols)Near-term
MIRI$7M (2025), $10M reserves~7 FTE comms, pivoting to governanceAgent Foundations, PolicyLow (fundamental theory)Long-term/governance
SSI (Sutskever)$2B raised (2024)New org, building teamSuperintelligence alignmentUnknown (stealth mode)Long-term
OpenAI (dissolved)20% compute allocationDisbanded May 2024Superalignment (weak-to-strong)N/A (team dissolved)N/A
OrganizationResearch Scientist RangeSenior/Lead RangeEquity/StockFellowship/Intern Rates
Anthropic$315K-$560K total$550K-$760K total4-year vesting$3,850/week + $15K/mo compute
DeepMind$200K-$260K base$300K+ with bonusGoogle RSUs$113K-$150K student researcher
ARC$150K-$400K (historical)Similar rangeNonprofit, no equity$15K/month intern
Redwood$150K-$300K (estimated)Similar rangeNonprofit, no equityFellowship program
MIRI$100K-$200K (estimated)Similar rangeNonprofit, no equitySummer fellows program
SSIUnknown (stealth)Premium for top talent$10B valuation equityN/A

The compensation gap between frontier labs (Anthropic, DeepMind) and nonprofit research organizations (ARC, Redwood, MIRI) creates significant talent allocation challenges for the field. Anthropic’s median total compensation of approximately $545K substantially exceeds nonprofit salaries, though nonprofits offer other benefits including mission alignment and research freedom.

The AI safety landscape features dramatically different research agendas based on fundamentally different theories about what will make advanced AI systems safe. These approaches range from Anthropic’s constitutional AI—which aims to instill human values through training—to MIRI’s agent foundations work that seeks deep theoretical understanding before building anything. The differences aren’t merely tactical but reflect deep disagreements about timelines, the nature of intelligence, and what kinds of failures we should expect.

This fragmentation reflects both uncertainty about which technical approaches will succeed and disagreement about what “success” even looks like. Some agendas focus on making current large language models safer through better training techniques. Others prepare for scenarios where we deploy AI systems we cannot fully trust or understand. Still others argue that without fundamental conceptual breakthroughs, all current approaches will fail catastrophically. Understanding these differences is crucial for researchers, policymakers, and anyone trying to evaluate the overall trajectory of AI safety work.

The stakes of these disagreements are enormous. If constitutional AI succeeds at scale, we might achieve safe AGI through careful scaling of current techniques. If it fails and we haven’t developed robust control protocols or verification methods, we could face exactly the catastrophic scenarios these research programs aim to prevent. The resource allocation decisions being made today across these different approaches may determine humanity’s ability to navigate the development of transformative AI.

The AI safety funding ecosystem has evolved dramatically, with total available funding estimated at $500M-$1B annually across all sources, though the vast majority concentrates at frontier labs:

Funding SourceAnnual AmountFocus AreasConstraints
Anthropic internal$100M+ (estimated)Constitutional AI, interpretability, RSPInternal priorities
DeepMind internal$50M+ (estimated)Scalable oversight, formal verificationGoogle priorities
Open Philanthropy$16.6M (deep learning alignment)Academic grants, org supportCapacity constrained
OpenAI Fast Grants$10M (one-time)External alignment researchCompleted program
LTFF/SFF$5-10M annuallySmall grants, independent researchersFunding limited since 2024
Government (AISI)Growing but smallEvaluation, standardsEarly stage

A critical constraint emerged in late 2024: alignment grantmaking became mostly funding-bottlenecked rather than talent-bottlenecked. The Lightspeed grants round received far more promising applications than available funding, and the Long-Term Future Fund reported insufficient resources for worthy grants. Approximately 80-90% of external alignment funding flows through Open Philanthropy, creating concentration risk and capacity constraints in grant evaluation.

Loading diagram...

Each research agenda addresses different failure modes with varying degrees of coverage:

AgendaDeceptive AlignmentReward HackingCapability OverhangCoordination FailureOverall Coverage
Constitutional AIPartial (training-based)HighLowLow40-50%
InterpretabilityHigh (if successful)MediumLowLow35-50%
Scalable OversightMediumHighMediumLow50-60%
AI ControlHigh (containment)MediumMediumLow55-65%
ELKHigh (if solved)LowLowLow30-40%
Agent FoundationsHigh (theoretical)High (theoretical)MediumLow25-40%

The table illustrates why portfolio diversification across research agendas is critical: no single approach provides comprehensive coverage of all major failure modes. Constitutional AI addresses reward hacking well but struggles with deceptive alignment; AI control handles near-term containment but may not scale to superintelligence; and agent foundations provides theoretical grounding but limited practical tools. Estimated overall coverage ranges from 25-65% for individual agendas, suggesting that even with substantial investment, significant risk gaps remain.

Anthropic’s approach represents the most direct attempt to scale current techniques to advanced AI systems. Their constitutional AI methodology trains language models to follow a set of principles through a two-stage process: first using AI feedback to critique and revise outputs according to the constitution, then using this data for reinforcement learning. Their Responsible Scaling Policy framework creates concrete capability thresholds (ASL-2, ASL-3, etc.) with corresponding safety requirements, providing a roadmap for scaling safely through increasingly capable systems.

With over $16 billion raised in 2025 (including a $13B Series F at $183B valuation in September 2025, followed by further growth to over $350 billion valuation by November), Anthropic dedicates substantial resources to safety research. Their Frontier Red Team comprises approximately 15 researchers who stress-test advanced systems for misuse risks in biological research, cybersecurity, and autonomous systems. The company’s Anthropic Fellows Program provides $3,850/week stipends plus ~$15k/month in compute funding, producing research papers from over 80% of fellows. Their Claude Academy has trained over 300 engineers in AI safety practices, converting generalist software developers into specialized practitioners.

The constitutional AI work has shown promising empirical results on current models, successfully reducing harmful outputs while maintaining helpfulness across various tasks. Their 2024 research on “sleeper agents” provided concerning evidence that deceptive alignment could persist through standard training techniques, while their mechanistic interpretability research aims to understand model internals well enough to detect such deception. The combination suggests a research program that takes seriously both the promise and the perils of scaling current approaches.

However, critics argue that constitutional AI may create a false sense of security by solving outer alignment (getting models to produce safe outputs) without addressing inner alignment (ensuring the model’s internal goals are aligned). The approach assumes that values can be reliably instilled through training rather than just learned as behavioral patterns that might break down under novel circumstances. Whether this assumption holds for superintelligent systems remains one of the most crucial open questions in AI safety.

DeepMind’s research program centers on the fundamental challenge of maintaining human oversight over AI systems that surpass human capabilities. Their debate approach, where AI systems argue different positions for human evaluation, aims to leverage competition to surface truth even when evaluators cannot directly assess claims. Process supervision evaluates reasoning steps rather than final outputs, potentially catching errors before they compound into dangerous conclusions.

Google DeepMind’s AGI Safety & Alignment team has grown 37-39% annually and now comprises 30-50 researchers depending on boundary definitions (not counting present-day safety or adversarial robustness teams). Leadership includes Anca Dragan, Rohin Shah, and Allan Dafoe under executive sponsor Shane Legg, with the team organized into AGI Alignment (mechanistic interpretability, scalable oversight) and Frontier Safety (dangerous capability evaluations). Their Frontier Safety Framework provides systematic evaluation protocols, while the AGI Safety Council analyzes risks and recommends safety practices. DeepMind’s safety research investment is estimated at over $50 million annually, though exact figures are not publicly disclosed. DeepMind has stated it is investing heavily in interpretability research to make AI systems more understandable and auditable.

The scalable oversight paradigm addresses a core problem: if AI systems become more capable than humans in domains like scientific research or strategic planning, how can we evaluate whether they’re pursuing our intended goals? Their research on recursive reward modeling explores using AI systems to help evaluate other AI systems, potentially creating a bootstrapping process for maintaining oversight as capabilities scale. This connects to their broader work on formal verification, attempting to provide mathematical guarantees about AI system behavior.

Recent results on process supervision in mathematical reasoning show promise, with models trained to optimize for correct reasoning steps rather than just correct answers showing improved robustness. However, the approach faces significant challenges in domains where correct processes are less well-defined than mathematics. Critics worry that debate could favor persuasive arguments over truthful ones, and that recursive evaluation might amplify rather than correct human biases embedded in the initial training process.

ARC’s Eliciting Latent Knowledge Research

Section titled “ARC’s Eliciting Latent Knowledge Research”

ARC’s research agenda, led by Paul Christiano, focuses on what many consider the core technical problem of AI alignment: ensuring that AI systems report what they actually know rather than what they think humans want to hear. The eliciting latent knowledge (ELK) problem assumes that advanced AI systems will develop rich internal representations of the world but may not be incentivized to share their true beliefs with human operators.

The Alignment Research Center operates with a small team of approximately 5-10 researchers and an annual budget of around $5 million. Early staff included Paul Christiano and Mark Xu, with the organization receiving $265,000 from Open Philanthropy in March 2022 and a subsequent $1.25 million grant for general support. In 2025, ARC reported making conceptual and theoretical progress at the fastest pace since 2022, with current research focusing on combining mechanistic interpretability with formal verification. The organization spun out ARC Evals as an independent nonprofit called METR in December 2023, led by Beth Barnes who joined from OpenAI. Historical salary ranges for full-time positions were $150k-400k annually, with intern positions at $15k/month. METR now partners with the AI Safety Institute, is part of the NIST AI Safety Institute Consortium, and conducts evaluations of frontier models including GPT o3, o4-mini, and Claude 3.7 Sonnet.

The ELK problem is particularly concerning in scenarios where AI systems understand that humans are using their reports to make important decisions. A system that understands human psychology and incentives might learn to give reassuring rather than accurate assessments of plans or situations. ARC’s research explores various technical approaches to extracting genuine beliefs, including methods that don’t rely on human feedback and could thus be harder for AI systems to manipulate.

ARC’s practical work through ARC Evals provides crucial empirical grounding by testing current systems for dangerous capabilities before they’re deployed. Their evaluations of GPT-4 for tasks like autonomous replication and persuasion established important precedents for capability assessment. However, the core ELK problem remains largely unsolved, with fundamental questions about whether latent knowledge even exists in the forms the research assumes. Some critics argue that attempting to solve ELK may be unnecessary if other approaches can provide adequate safety guarantees.

Redwood Research takes a pragmatic approach that explicitly doesn’t require solving alignment. Their AI control paradigm assumes we may deploy powerful AI systems before we fully understand or trust them, and focuses on developing protocols to safely use untrusted AI. This includes monitoring systems, decomposing tasks so that no single AI system has excessive power, and maintaining human oversight over critical decisions.

Founded in August 2021, Redwood Research has received approximately $21 million in total funding: $20 million from Open Philanthropy and $1.27 million from the Survival and Flourishing Fund, with approximately two-thirds ($14M) for research operations and one-third ($7M) for running Constellation. Their ICML oral paper “AI Control: Improving Risk Despite Intentional Subversion” introduced protocols for monitoring malign LLM agents, establishing AI control as a bedrock approach for mitigating catastrophic risk. Leadership includes Nate Thomas (CEO), Buck Shlegeris (CTO), and Bill Zito (COO), with a board comprising Nate Thomas, Holden Karnofsky, and Paul Christiano. Redwood collaborates with governments and advises AI companies including Google DeepMind and Anthropic, partnering with UK AISI to produce “A sketch of an AI control safety case.” The organization runs the Constellation office space, a 30,000 square foot facility hosting staff from multiple AI safety organizations including Open Philanthropy, ARC, Atlas Fellowship, and CEA, with a Visiting Fellows program that has placed dozens of participants into full-time roles on leading safety teams.

The AI control approach acknowledges that perfect alignment may be impossible or insufficient given competitive pressures and deployment timelines. Instead, it focuses on containment strategies, robust monitoring, and defense-in-depth approaches that could work even with partially misaligned systems. Their research on control protocols examines specific technical mechanisms: how to detect when AI systems are acting outside their intended scope, how to limit their capabilities while preserving usefulness, and how to maintain meaningful human control over important decisions.

This paradigm offers near-term practical benefits and could buy crucial time for other safety research. Control protocols could help prevent immediate harms from current systems while longer-term alignment research continues. However, critics argue that the approach may not scale to truly superintelligent systems that could find ways to circumvent monitoring and controls. There’s also concern that successful control protocols might create complacency, reducing incentives to solve more fundamental alignment problems.

MIRI’s research program stands apart by focusing on fundamental theoretical questions about agency, goals, and decision-making before attempting to build aligned AI systems. Their agent foundations research explores problems like embedded agency (how can agents reason about themselves as part of the world they’re trying to understand?), logical uncertainty (how to make decisions when even logical facts are uncertain), and decision theory for AI systems.

The Machine Intelligence Research Institute operates with a projected 2025 budget of $7.1 million and reserves of $10 million—enough for approximately 1.5 years if no additional funds are raised. Annual expenses from 2019-2023 ranged from $5.4M to $7.7M, with 2026 projections at a median of $8M (potentially up to $10M with growth). MIRI is conducting its first fundraiser in six years, targeting $6M with the first $1.6M matched 1:1 via an SFF grant (with a stretch goal of $10M total). This would bring reserves to their two-year target of $16M. Historically, MIRI received $2.1 million from Open Philanthropy in 2019 supplemented by $7.7M in 2020, plus several million dollars in Ethereum from Vitalik Buterin in 2021. The communications team currently comprises approximately seven full-time employees (including Nate Soares and Eliezer Yudkowsky), with plans to grow in 2026. The organization has pivoted from technical AI safety research to focusing on informing policymakers and the public about AI risks, expanding its communications team and launching a technical AI governance program.

This approach reflects deep skepticism about whether current AI paradigms can be made safe without fundamental conceptual breakthroughs. MIRI researchers argue that most other approaches are building on shaky theoretical foundations and may create capabilities without the conceptual tools needed to ensure safety. Their work on topics like Löb’s theorem and reflective reasoning explores how advanced AI systems might reason about their own reasoning processes—a capability that could be crucial for ensuring they remain aligned as they modify themselves or create successor systems.

MIRI’s theoretical rigor has identified important problems and concepts that other researchers now work on. However, their approach faces criticism for being too abstract and disconnected from practical AI development. The theoretical questions they focus on may indeed need to be solved eventually, but the research program offers little immediate help with near-term AI systems. Given uncertain timelines to transformative AI, the value of their work depends heavily on whether we have time for fundamental research or need immediate practical solutions.

OpenAI’s Superalignment team, established in July 2023, represented a major commitment to solving superintelligence alignment within four years. Co-led by Ilya Sutskever (Chief Scientist) and Jan Leike (Head of Alignment), the team received 20% of OpenAI’s compute allocation—an unprecedented resource dedication for safety research. The team focused on “weak-to-strong generalization,” exploring whether weaker AI models could effectively supervise and align stronger successors.

However, in May 2024, OpenAI disbanded the Superalignment team less than one year after its formation. Jan Leike sharply criticized the company upon departure, stating “Over the past years, safety culture and processes have taken a backseat to shiny products.” Ilya Sutskever also left, with Jakub Pachocki replacing him as Chief Scientist and John Schulman becoming scientific lead for alignment work. Jan Leike subsequently joined Anthropic to lead a new superalignment team focusing on scalable oversight, weak-to-strong generalization, and automated alignment research.

In October 2024, OpenAI disbanded another safety team—the “AGI Readiness” team that advised on the company’s capacity to handle increasingly powerful AI. Miles Brundage, senior advisor for AGI Readiness, wrote upon departure: “Neither OpenAI nor any other frontier lab is ready, and the world is also not ready.” The dissolution of both teams within six months raised significant questions about OpenAI’s commitment to safety research relative to product development.

In June 2024, Ilya Sutskever—former OpenAI Chief Scientist and co-lead of the dissolved Superalignment team—founded Safe Superintelligence Inc. (SSI) with the explicit mission of building safe superintelligence as its sole focus. SSI raised $2 billion in its first funding round, achieving a $10 billion valuation before generating any revenue or releasing products.

The company operates in stealth mode with limited public information about its research approach, though its founding principles emphasize that safety and capabilities will be advanced together from the ground up. SSI represents a significant bet that a new organization, freed from the commercial pressures and legacy decisions of existing labs, can more effectively pursue safe superintelligence. The $2 billion in initial funding—comparable to years of safety research funding across the entire nonprofit field—provides substantial runway for long-term research without near-term product pressures.

Critics question whether SSI’s approach will differ meaningfully from existing efforts, and whether the stealth mode prevents the kind of external scrutiny that could improve safety outcomes. Supporters argue that Sutskever’s technical leadership and the organization’s singular focus on superintelligence alignment represent the most direct attempt to solve the core problem before capability pressures make it intractable.

Research AgendaFunding/ResourcesTeam SizeKey PublicationsOutput MetricsIndustry Adoption
Anthropic Constitutional AI$100M+ annually; $16B raised 2025~15 red team + safety teamsRSP framework, Sleeper agents paper, Constitutional AI paper80%+ fellows publish; 300+ Claude Academy graduatesHigh (RSP adopted by labs)
Anthropic InterpretabilityShared safety budgetIntegrated with safety teamsMechanistic interpretability papers, Circuit analysisGrowing publication outputMedium (research influence)
DeepMind Scalable Oversight$50M+ estimated30-50 researchers; growing 37-39% annuallyDebate papers, Process supervision researchMultiple high-impact papersMedium (research influence)
DeepMind Formal VerificationShared budgetIntegrated with safety teamsMathematical verification researchTargeted publicationsLow (early stage)
ARC (ELK)~$5M annually~5-10 researchersELK technical report, METR evaluationsConceptual progress 2025Low (fundamental research)
Redwood (AI Control)$21M total raisedSmall team + ConstellationICML oral paper on AI controlGovernment/industry collaborationsHigh (UK AISI partnership)
MIRI (Agent Foundations)$7.1M (2025); $10M reserves~7 FTE comms, pivoting to governanceAgent foundations papers, Policy communicationsShifting to policy outputLow (pivoting away)
SSI (Sutskever)$2B raised at $10B valuationBuilding team (stealth)None public yetUnknown (stealth mode)N/A (new)
OpenAI Superalignment20% compute allocationDisbanded May 2024Weak-to-strong generalization researchN/A (dissolved)N/A (dissolved)

Each research agenda implies different failure modes and offers different types of safety assurances. Constitutional AI could fail if values aren’t genuinely internalized, leading to systems that behave safely during training but pursue misaligned goals when deployed in novel situations. The approach also depends on the assumption that we can identify and specify human values clearly enough to train systems effectively—a assumption that may not hold for complex moral questions.

Scalable oversight approaches face the challenge that oversight itself might be gameable by sufficiently sophisticated systems. If AI systems understand they’re being evaluated, they might optimize for appearing aligned rather than being aligned. The debate and process supervision methods also assume that correct reasoning processes can be identified and evaluated, which may not hold in domains where the correct approach is itself uncertain or controversial.

The control paradigm accepts some risk of misalignment in exchange for practical deployability, but this creates a fundamental scaling problem. As AI systems become more capable, the gap between their abilities and human oversight capabilities grows, potentially making control protocols increasingly ineffective. There’s also the risk that control protocols designed for current systems won’t transfer to fundamentally different AI architectures.

Each approach also offers distinct benefits. Constitutional AI provides immediate practical benefits for current systems and could scale smoothly if its assumptions hold. Scalable oversight directly addresses the superintelligence control problem. AI control offers near-term safety even without solving alignment. Agent foundations could prevent fundamental errors in how we think about AI goals and decision-making.

Timeline Considerations and Research Trajectories

Section titled “Timeline Considerations and Research Trajectories”

The research agendas operate on different timelines that reflect their different assumptions about AI development. Anthropic’s constitutional AI work focuses on systems likely to be deployed within 2-5 years, with their Responsible Scaling Policy providing a framework for capability thresholds through potential AGI development. This near-term focus allows for empirical validation but may miss longer-term failure modes.

DeepMind’s scalable oversight research targets the 5-10 year timeframe when AI systems might exceed human capabilities in most domains. Their formal verification work requires significant theoretical development but could provide stronger safety guarantees. The timeline pressure creates tension between developing robust theoretical foundations and providing practical solutions for nearer-term systems.

ARC’s ELK research addresses problems that become critical as AI systems become sophisticated enough to understand and potentially manipulate human oversight. This could become relevant within 5-7 years as language models develop better theory of mind capabilities. However, fundamental progress on ELK has been limited, raising questions about whether the problem is solvable within relevant timelines.

Current trajectories suggest that multiple approaches will likely be needed rather than a single solution. Constitutional AI and control protocols provide immediate benefits for current systems. Interpretability and oversight methods could become crucial as capabilities scale. Foundational research might prevent fundamental errors in system design. The key question is resource allocation across these different timelines and approaches.

The most fundamental uncertainty is whether current AI paradigms—large language models trained with reinforcement learning from human feedback—will lead directly to transformative AI. If scaling current approaches leads to AGI, then research on constitutional AI and scalable oversight directly addresses the systems we’ll need to align. If a paradigm shift occurs, much current safety research might not transfer to new architectures.

Another crucial uncertainty is the feasibility of verification without full alignment. Can we develop interpretability tools good enough to detect deceptive alignment? Can oversight methods scale to superintelligent systems? These questions determine whether we can safely deploy AI systems we don’t fully understand or trust—a scenario that seems increasingly likely given competitive pressures.

The tractability of fundamental theoretical problems remains unclear. Are problems like ELK solvable in principle? Do we need conceptual breakthroughs about agency and goals before building advanced AI? The answers affect how much effort should go into foundational research versus empirical approaches with current systems.

Strategic uncertainties include how different research agendas interact and whether they’re complements or substitutes. Does success in constitutional AI reduce the need for control protocols, or do we need defense in depth? How do competitive dynamics between AI developers affect the feasibility of different safety approaches? These considerations may be as important as the technical questions for determining which research directions to prioritize.

The comparison reveals that rather than competing approaches, we likely need a portfolio strategy that addresses different scenarios and timelines. The optimal allocation depends on beliefs about AI development trajectories, the tractability of different technical problems, and the strategic landscape for AI deployment. Understanding these trade-offs is essential for making effective research and policy decisions in an uncertain but rapidly evolving field.

📊Research Agenda Overview
NamePrimary FocusKey AssumptionTimeline ViewTheory of Change
Anthropic (Constitutional AI)Train models to be helpful, harmless, honest via AI feedbackCan instill values through training at scaleShort (2026-2030)Build safest frontier models → demonstrate safe scaling → set industry standard
Anthropic (Interpretability)Understand model internals to verify alignmentNeural networks have interpretable structureShort-mediumMechanistic understanding → can verify goals → detect deception
OpenAI (Superalignment) [dissolved]Use weaker AI to align stronger AIWeak-to-strong generalization worksShort (2027-2030)Bootstrap alignment from current models to future models
DeepMind (Scalable Oversight)Debate, recursive reward modeling, process supervisionCan scale human oversight to superhuman AIMedium (2030-2040)Better evaluation → catch misalignment → iterate to safety
ARC (Eliciting Latent Knowledge)Get AI to report what it actually knowsAI has latent knowledge that could be extractedMediumSolve ELK → detect deception → verify alignment
Redwood (AI Control)Maintain control even with misaligned AIDon't need alignment, just controlNear-term focusUntrusted AI can be safely used with proper protocols
MIRI (Agent Foundations)Fundamental theory of agency and goalsNeed conceptual breakthroughs firstUncertain, possibly insufficient timeDeep understanding → know what to build → align by design
SSI (Sutskever)Safe superintelligence from ground upFresh start enables better safety integrationLong-term (superintelligence focus)Build safety into capabilities from start → achieve safe superintelligence
📊Agenda Comparison by Dimension
NameSolves Inner Alignment?Scales to Superhuman?Works Without Full Alignment?Empirically Testable Now?
Constitutional AIUncertainmediumUnknownmediumNolowYeshigh
InterpretabilityCould help verifymediumUnknownmediumHelps detect issuesmediumYeshigh
Scalable OversightNolowDesigned tohighMaybemediumPartiallymedium
ELKWould help detectmediumIntended tomediumNolowLimitedlow
AI ControlNo (bypasses it)lowProbably notlowYes (the point)highYeshigh
Agent FoundationsAims tomediumWould if successfulmediumNolowNolow
SSI (Sutskever)Unknown (stealth)mediumPrimary goalhighUnknownmediumNo (stealth)low
⚖️Which Research Agenda Will Succeed?

Views on which approach is most likely to prevent AI catastrophe

Need novel breakthroughs
Current approaches scale
MIRI / Yudkowsky
15-20%
Current approaches fail
●●●
ARC / Paul Christiano
40-50%
Need targeted breakthroughs
●○○
DeepMind researchers
50-60%
Scalable oversight works
●●○
Redwood
Near-term high
Control buys time
●●○
SSI / Sutskever
Unknown
Fresh start needed
●○○
Anthropic researchers
60-70%
Current approaches can scale
●●○

Key Questions

Will current paradigm (LLMs + RLHF) lead to transformative AI?
Can we verify alignment without solving it?

Research agenda diversification improves the Ai Transition Model across multiple factors:

FactorParameterImpact
Misalignment PotentialAlignment RobustnessPortfolio approach hedges against any single methodology failing
Misalignment PotentialSafety-Capability GapMultiple research paths increase probability of keeping pace with capabilities
Transition TurbulenceRacing IntensityCoordinated safety investment across labs reduces competitive pressure to cut corners

The distribution of research funding across agendas significantly affects the probability of achieving safe AI development across different scenarios.