Research Agenda Comparison

📋Page Status

Quality:82 (Comprehensive)

Importance:85 (High)

Last edited:2025-12-28 (10 days ago)

Words:4.4k

Structure:

📊 6📈 1🔗 46📚 0•6%Score: 11/15

LLM Summary:Comprehensive analysis comparing major AI safety research agendas (Anthropic, DeepMind, ARC, Redwood, MIRI) across budget, staff, timeline, and tractability dimensions. Provides quantified organizational data ($100M+ for Anthropic, $50M+ for DeepMind) and identifies fundamental theoretical disagreements that shape resource allocation decisions for reducing existential risk.

Key Crux

Research Agendas

Importance85

FocusComparing approaches to AI alignment

Key TensionEmpirical vs. theoretical, prosaic vs. novel

Related ToAlignment Difficulty, Timelines

Organizations

Anthropic

Quick Assessment

Organization	Annual Budget	Staff Size	Primary Focus	Tractability	Timeline
Anthropic	~$100M+ (safety research)	~15 red team + growing safety org	Constitutional AI, Interpretability	High (empirical validation)	2-5 years
DeepMind	$50M+ (estimated)	30-50 researchers, growing 37-39% annually	Scalable Oversight, Formal Verification	Medium (theoretical + empirical)	5-10 years
ARC	~$5M annually	~5-10 researchers	Eliciting Latent Knowledge	Low (fundamental problem)	5-7 years
Redwood Research	~$5-10M annually ($21M total raised)	Small team + Constellation (30k sqft)	AI Control protocols	High (practical protocols)	Near-term
MIRI	$7M (2025), $10M reserves	~7 FTE comms, pivoting to governance	Agent Foundations, Policy	Low (fundamental theory)	Long-term/governance
SSI (Sutskever)	$2B raised (2024)	New org, building team	Superintelligence alignment	Unknown (stealth mode)	Long-term
OpenAI (dissolved)	20% compute allocation	Disbanded May 2024	Superalignment (weak-to-strong)	N/A (team dissolved)	N/A

Researcher Compensation Comparison

Organization	Research Scientist Range	Senior/Lead Range	Equity/Stock	Fellowship/Intern Rates
Anthropic	$315K-$560K total	$550K-$760K total	4-year vesting	$3,850/week + $15K/mo compute
DeepMind	$200K-$260K base	$300K+ with bonus	Google RSUs	$113K-$150K student researcher
ARC	$150K-$400K (historical)	Similar range	Nonprofit, no equity	$15K/month intern
Redwood	$150K-$300K (estimated)	Similar range	Nonprofit, no equity	Fellowship program
MIRI	$100K-$200K (estimated)	Similar range	Nonprofit, no equity	Summer fellows program
SSI	Unknown (stealth)	Premium for top talent	$10B valuation equity	N/A

The compensation gap between frontier labs (Anthropic, DeepMind) and nonprofit research organizations (ARC, Redwood, MIRI) creates significant talent allocation challenges for the field. Anthropic’s median total compensation of approximately $545K substantially exceeds nonprofit salaries, though nonprofits offer other benefits including mission alignment and research freedom.

Overview

The AI safety landscape features dramatically different research agendas based on fundamentally different theories about what will make advanced AI systems safe. These approaches range from Anthropic’s constitutional AI—which aims to instill human values through training—to MIRI’s agent foundations work that seeks deep theoretical understanding before building anything. The differences aren’t merely tactical but reflect deep disagreements about timelines, the nature of intelligence, and what kinds of failures we should expect.

This fragmentation reflects both uncertainty about which technical approaches will succeed and disagreement about what “success” even looks like. Some agendas focus on making current large language models safer through better training techniques. Others prepare for scenarios where we deploy AI systems we cannot fully trust or understand. Still others argue that without fundamental conceptual breakthroughs, all current approaches will fail catastrophically. Understanding these differences is crucial for researchers, policymakers, and anyone trying to evaluate the overall trajectory of AI safety work.

The stakes of these disagreements are enormous. If constitutional AI succeeds at scale, we might achieve safe AGI through careful scaling of current techniques. If it fails and we haven’t developed robust control protocols or verification methods, we could face exactly the catastrophic scenarios these research programs aim to prevent. The resource allocation decisions being made today across these different approaches may determine humanity’s ability to navigate the development of transformative AI.

AI Safety Funding Landscape (2025)

The AI safety funding ecosystem has evolved dramatically, with total available funding estimated at $500M-$1B annually across all sources, though the vast majority concentrates at frontier labs:

Funding Source	Annual Amount	Focus Areas	Constraints
Anthropic internal	$100M+ (estimated)	Constitutional AI, interpretability, RSP	Internal priorities
DeepMind internal	$50M+ (estimated)	Scalable oversight, formal verification	Google priorities
Open Philanthropy	$16.6M (deep learning alignment)	Academic grants, org support	Capacity constrained
OpenAI Fast Grants	$10M (one-time)	External alignment research	Completed program
LTFF/SFF	$5-10M annually	Small grants, independent researchers	Funding limited since 2024
Government (AISI)	Growing but small	Evaluation, standards	Early stage

A critical constraint emerged in late 2024: alignment grantmaking became mostly funding-bottlenecked rather than talent-bottlenecked. The Lightspeed grants round received far more promising applications than available funding, and the Long-Term Future Fund reported insufficient resources for worthy grants. Approximately 80-90% of external alignment funding flows through Open Philanthropy, creating concentration risk and capacity constraints in grant evaluation.

Research Agenda Relationships

Loading diagram...

Research Agenda Risk Assessment

Each research agenda addresses different failure modes with varying degrees of coverage:

Agenda	Deceptive Alignment	Reward Hacking	Capability Overhang	Coordination Failure	Overall Coverage
Constitutional AI	Partial (training-based)	High	Low	Low	40-50%
Interpretability	High (if successful)	Medium	Low	Low	35-50%
Scalable Oversight	Medium	High	Medium	Low	50-60%
AI Control	High (containment)	Medium	Medium	Low	55-65%
ELK	High (if solved)	Low	Low	Low	30-40%
Agent Foundations	High (theoretical)	High (theoretical)	Medium	Low	25-40%

The table illustrates why portfolio diversification across research agendas is critical: no single approach provides comprehensive coverage of all major failure modes. Constitutional AI addresses reward hacking well but struggles with deceptive alignment; AI control handles near-term containment but may not scale to superintelligence; and agent foundations provides theoretical grounding but limited practical tools. Estimated overall coverage ranges from 25-65% for individual agendas, suggesting that even with substantial investment, significant risk gaps remain.

Core Research Agenda Analysis

Anthropic’s Constitutional AI Paradigm

Anthropic’s approach represents the most direct attempt to scale current techniques to advanced AI systems. Their constitutional AI methodology trains language models to follow a set of principles through a two-stage process: first using AI feedback to critique and revise outputs according to the constitution, then using this data for reinforcement learning. Their Responsible Scaling Policy framework creates concrete capability thresholds (ASL-2, ASL-3, etc.) with corresponding safety requirements, providing a roadmap for scaling safely through increasingly capable systems.

With over $16 billion raised in 2025↗ (including a $13B Series F at $183B valuation in September 2025, followed by further growth to over $350 billion valuation by November), Anthropic dedicates substantial resources to safety research. Their Frontier Red Team comprises approximately 15 researchers who stress-test advanced systems for misuse risks in biological research, cybersecurity, and autonomous systems. The company’s Anthropic Fellows Program↗ provides $3,850/week stipends plus ~$15k/month in compute funding, producing research papers from over 80% of fellows. Their Claude Academy has trained over 300 engineers in AI safety practices, converting generalist software developers into specialized practitioners.

The constitutional AI work has shown promising empirical results on current models, successfully reducing harmful outputs while maintaining helpfulness across various tasks. Their 2024 research on “sleeper agents” provided concerning evidence that deceptive alignment could persist through standard training techniques, while their mechanistic interpretability research aims to understand model internals well enough to detect such deception. The combination suggests a research program that takes seriously both the promise and the perils of scaling current approaches.

However, critics argue that constitutional AI may create a false sense of security by solving outer alignment (getting models to produce safe outputs) without addressing inner alignment (ensuring the model’s internal goals are aligned). The approach assumes that values can be reliably instilled through training rather than just learned as behavioral patterns that might break down under novel circumstances. Whether this assumption holds for superintelligent systems remains one of the most crucial open questions in AI safety.

DeepMind’s Scalable Oversight Framework

DeepMind’s research program centers on the fundamental challenge of maintaining human oversight over AI systems that surpass human capabilities. Their debate approach, where AI systems argue different positions for human evaluation, aims to leverage competition to surface truth even when evaluators cannot directly assess claims. Process supervision evaluates reasoning steps rather than final outputs, potentially catching errors before they compound into dangerous conclusions.

Google DeepMind’s AGI Safety & Alignment team↗ has grown 37-39% annually and now comprises 30-50 researchers depending on boundary definitions (not counting present-day safety or adversarial robustness teams). Leadership includes Anca Dragan, Rohin Shah, and Allan Dafoe under executive sponsor Shane Legg, with the team organized into AGI Alignment (mechanistic interpretability, scalable oversight) and Frontier Safety (dangerous capability evaluations). Their Frontier Safety Framework↗ provides systematic evaluation protocols, while the AGI Safety Council analyzes risks and recommends safety practices. DeepMind’s safety research investment is estimated at over $50 million annually, though exact figures are not publicly disclosed. DeepMind has stated it is investing heavily in interpretability research to make AI systems more understandable and auditable.

The scalable oversight paradigm addresses a core problem: if AI systems become more capable than humans in domains like scientific research or strategic planning, how can we evaluate whether they’re pursuing our intended goals? Their research on recursive reward modeling explores using AI systems to help evaluate other AI systems, potentially creating a bootstrapping process for maintaining oversight as capabilities scale. This connects to their broader work on formal verification, attempting to provide mathematical guarantees about AI system behavior.

Recent results on process supervision in mathematical reasoning show promise, with models trained to optimize for correct reasoning steps rather than just correct answers showing improved robustness. However, the approach faces significant challenges in domains where correct processes are less well-defined than mathematics. Critics worry that debate could favor persuasive arguments over truthful ones, and that recursive evaluation might amplify rather than correct human biases embedded in the initial training process.

ARC’s Eliciting Latent Knowledge Research

ARC’s research agenda, led by Paul Christiano, focuses on what many consider the core technical problem of AI alignment: ensuring that AI systems report what they actually know rather than what they think humans want to hear. The eliciting latent knowledge (ELK) problem assumes that advanced AI systems will develop rich internal representations of the world but may not be incentivized to share their true beliefs with human operators.

The Alignment Research Center↗ operates with a small team of approximately 5-10 researchers and an annual budget of around $5 million. Early staff included Paul Christiano and Mark Xu, with the organization receiving $265,000 from Open Philanthropy in March 2022↗ and a subsequent $1.25 million grant for general support. In 2025, ARC reported making conceptual and theoretical progress at the fastest pace since 2022, with current research focusing on combining mechanistic interpretability with formal verification. The organization spun out ARC Evals as an independent nonprofit called METR in December 2023, led by Beth Barnes who joined from OpenAI. Historical salary ranges for full-time positions were $150k-400k annually, with intern positions at $15k/month. METR now partners with the AI Safety Institute, is part of the NIST AI Safety Institute Consortium, and conducts evaluations of frontier models including GPT o3, o4-mini, and Claude 3.7 Sonnet.

The ELK problem is particularly concerning in scenarios where AI systems understand that humans are using their reports to make important decisions. A system that understands human psychology and incentives might learn to give reassuring rather than accurate assessments of plans or situations. ARC’s research explores various technical approaches to extracting genuine beliefs, including methods that don’t rely on human feedback and could thus be harder for AI systems to manipulate.

ARC’s practical work through ARC Evals provides crucial empirical grounding by testing current systems for dangerous capabilities before they’re deployed. Their evaluations of GPT-4 for tasks like autonomous replication and persuasion established important precedents for capability assessment. However, the core ELK problem remains largely unsolved, with fundamental questions about whether latent knowledge even exists in the forms the research assumes. Some critics argue that attempting to solve ELK may be unnecessary if other approaches can provide adequate safety guarantees.

Redwood Research’s AI Control Approach

Redwood Research takes a pragmatic approach that explicitly doesn’t require solving alignment. Their AI control paradigm assumes we may deploy powerful AI systems before we fully understand or trust them, and focuses on developing protocols to safely use untrusted AI. This includes monitoring systems, decomposing tasks so that no single AI system has excessive power, and maintaining human oversight over critical decisions.

Founded in August 2021, Redwood Research↗ has received approximately $21 million in total funding: $20 million from Open Philanthropy↗ and $1.27 million from the Survival and Flourishing Fund, with approximately two-thirds ($14M) for research operations and one-third ($7M) for running Constellation. Their ICML oral paper “AI Control: Improving Risk Despite Intentional Subversion”↗ introduced protocols for monitoring malign LLM agents, establishing AI control as a bedrock approach for mitigating catastrophic risk. Leadership includes Nate Thomas (CEO), Buck Shlegeris (CTO), and Bill Zito (COO), with a board comprising Nate Thomas, Holden Karnofsky, and Paul Christiano. Redwood collaborates with governments and advises AI companies including Google DeepMind and Anthropic, partnering with UK AISI to produce “A sketch of an AI control safety case.” The organization runs the Constellation office space, a 30,000 square foot facility hosting staff from multiple AI safety organizations including Open Philanthropy, ARC, Atlas Fellowship, and CEA, with a Visiting Fellows program that has placed dozens of participants into full-time roles on leading safety teams.

The AI control approach acknowledges that perfect alignment may be impossible or insufficient given competitive pressures and deployment timelines. Instead, it focuses on containment strategies, robust monitoring, and defense-in-depth approaches that could work even with partially misaligned systems. Their research on control protocols examines specific technical mechanisms: how to detect when AI systems are acting outside their intended scope, how to limit their capabilities while preserving usefulness, and how to maintain meaningful human control over important decisions.

This paradigm offers near-term practical benefits and could buy crucial time for other safety research. Control protocols could help prevent immediate harms from current systems while longer-term alignment research continues. However, critics argue that the approach may not scale to truly superintelligent systems that could find ways to circumvent monitoring and controls. There’s also concern that successful control protocols might create complacency, reducing incentives to solve more fundamental alignment problems.

MIRI’s Agent Foundations Program

MIRI’s research program stands apart by focusing on fundamental theoretical questions about agency, goals, and decision-making before attempting to build aligned AI systems. Their agent foundations research explores problems like embedded agency (how can agents reason about themselves as part of the world they’re trying to understand?), logical uncertainty (how to make decisions when even logical facts are uncertain), and decision theory for AI systems.

The Machine Intelligence Research Institute↗ operates with a projected 2025 budget of $7.1 million and reserves of $10 million—enough for approximately 1.5 years if no additional funds are raised. Annual expenses from 2019-2023 ranged from $5.4M to $7.7M, with 2026 projections at a median of $8M (potentially up to $10M with growth). MIRI is conducting its first fundraiser in six years↗, targeting $6M with the first $1.6M matched 1:1 via an SFF grant (with a stretch goal of $10M total). This would bring reserves to their two-year target of $16M. Historically, MIRI received $2.1 million from Open Philanthropy in 2019↗ supplemented by $7.7M in 2020, plus several million dollars in Ethereum from Vitalik Buterin in 2021. The communications team currently comprises approximately seven full-time employees (including Nate Soares and Eliezer Yudkowsky), with plans to grow in 2026. The organization has pivoted from technical AI safety research to focusing on informing policymakers and the public about AI risks, expanding its communications team and launching a technical AI governance program.

This approach reflects deep skepticism about whether current AI paradigms can be made safe without fundamental conceptual breakthroughs. MIRI researchers argue that most other approaches are building on shaky theoretical foundations and may create capabilities without the conceptual tools needed to ensure safety. Their work on topics like Löb’s theorem and reflective reasoning explores how advanced AI systems might reason about their own reasoning processes—a capability that could be crucial for ensuring they remain aligned as they modify themselves or create successor systems.

MIRI’s theoretical rigor has identified important problems and concepts that other researchers now work on. However, their approach faces criticism for being too abstract and disconnected from practical AI development. The theoretical questions they focus on may indeed need to be solved eventually, but the research program offers little immediate help with near-term AI systems. Given uncertain timelines to transformative AI, the value of their work depends heavily on whether we have time for fundamental research or need immediate practical solutions.

OpenAI’s Superalignment (Disbanded)

OpenAI’s Superalignment team, established in July 2023, represented a major commitment to solving superintelligence alignment within four years. Co-led by Ilya Sutskever (Chief Scientist) and Jan Leike (Head of Alignment), the team received 20% of OpenAI’s compute allocation↗—an unprecedented resource dedication for safety research. The team focused on “weak-to-strong generalization,” exploring whether weaker AI models could effectively supervise and align stronger successors.

However, in May 2024, OpenAI disbanded the Superalignment team↗ less than one year after its formation. Jan Leike sharply criticized the company upon departure, stating “Over the past years, safety culture and processes have taken a backseat to shiny products.” Ilya Sutskever also left, with Jakub Pachocki replacing him as Chief Scientist and John Schulman becoming scientific lead for alignment work. Jan Leike subsequently joined Anthropic↗ to lead a new superalignment team focusing on scalable oversight, weak-to-strong generalization, and automated alignment research.

In October 2024, OpenAI disbanded another safety team↗—the “AGI Readiness” team that advised on the company’s capacity to handle increasingly powerful AI. Miles Brundage, senior advisor for AGI Readiness, wrote upon departure: “Neither OpenAI nor any other frontier lab is ready, and the world is also not ready.” The dissolution of both teams within six months raised significant questions about OpenAI’s commitment to safety research relative to product development.

Safe Superintelligence Inc. (SSI)

In June 2024, Ilya Sutskever—former OpenAI Chief Scientist and co-lead of the dissolved Superalignment team—founded Safe Superintelligence Inc. (SSI) with the explicit mission of building safe superintelligence as its sole focus. SSI raised $2 billion in its first funding round, achieving a $10 billion valuation before generating any revenue or releasing products.

The company operates in stealth mode with limited public information about its research approach, though its founding principles emphasize that safety and capabilities will be advanced together from the ground up. SSI represents a significant bet that a new organization, freed from the commercial pressures and legacy decisions of existing labs, can more effectively pursue safe superintelligence. The $2 billion in initial funding—comparable to years of safety research funding across the entire nonprofit field—provides substantial runway for long-term research without near-term product pressures.

Critics question whether SSI’s approach will differ meaningfully from existing efforts, and whether the stealth mode prevents the kind of external scrutiny that could improve safety outcomes. Supporters argue that Sutskever’s technical leadership and the organization’s singular focus on superintelligence alignment represent the most direct attempt to solve the core problem before capability pressures make it intractable.

Detailed Research Agenda Comparison

Research Agenda	Funding/Resources	Team Size	Key Publications	Output Metrics	Industry Adoption
Anthropic Constitutional AI	$100M+ annually; $16B raised 2025	~15 red team + safety teams	RSP framework, Sleeper agents paper, Constitutional AI paper	80%+ fellows publish; 300+ Claude Academy graduates	High (RSP adopted by labs)
Anthropic Interpretability	Shared safety budget	Integrated with safety teams	Mechanistic interpretability papers, Circuit analysis	Growing publication output	Medium (research influence)
DeepMind Scalable Oversight	$50M+ estimated	30-50 researchers; growing 37-39% annually	Debate papers, Process supervision research	Multiple high-impact papers	Medium (research influence)
DeepMind Formal Verification	Shared budget	Integrated with safety teams	Mathematical verification research	Targeted publications	Low (early stage)
ARC (ELK)	~$5M annually	~5-10 researchers	ELK technical report, METR evaluations	Conceptual progress 2025	Low (fundamental research)
Redwood (AI Control)	$21M total raised	Small team + Constellation	ICML oral paper on AI control	Government/industry collaborations	High (UK AISI partnership)
MIRI (Agent Foundations)	$7.1M (2025); $10M reserves	~7 FTE comms, pivoting to governance	Agent foundations papers, Policy communications	Shifting to policy output	Low (pivoting away)
SSI (Sutskever)	$2B raised at $10B valuation	Building team (stealth)	None public yet	Unknown (stealth mode)	N/A (new)
OpenAI Superalignment	20% compute allocation	Disbanded May 2024	Weak-to-strong generalization research	N/A (dissolved)	N/A (dissolved)

Safety Implications and Failure Modes

Each research agenda implies different failure modes and offers different types of safety assurances. Constitutional AI could fail if values aren’t genuinely internalized, leading to systems that behave safely during training but pursue misaligned goals when deployed in novel situations. The approach also depends on the assumption that we can identify and specify human values clearly enough to train systems effectively—a assumption that may not hold for complex moral questions.

Scalable oversight approaches face the challenge that oversight itself might be gameable by sufficiently sophisticated systems. If AI systems understand they’re being evaluated, they might optimize for appearing aligned rather than being aligned. The debate and process supervision methods also assume that correct reasoning processes can be identified and evaluated, which may not hold in domains where the correct approach is itself uncertain or controversial.

The control paradigm accepts some risk of misalignment in exchange for practical deployability, but this creates a fundamental scaling problem. As AI systems become more capable, the gap between their abilities and human oversight capabilities grows, potentially making control protocols increasingly ineffective. There’s also the risk that control protocols designed for current systems won’t transfer to fundamentally different AI architectures.

Each approach also offers distinct benefits. Constitutional AI provides immediate practical benefits for current systems and could scale smoothly if its assumptions hold. Scalable oversight directly addresses the superintelligence control problem. AI control offers near-term safety even without solving alignment. Agent foundations could prevent fundamental errors in how we think about AI goals and decision-making.

Timeline Considerations and Research Trajectories

The research agendas operate on different timelines that reflect their different assumptions about AI development. Anthropic’s constitutional AI work focuses on systems likely to be deployed within 2-5 years, with their Responsible Scaling Policy providing a framework for capability thresholds through potential AGI development. This near-term focus allows for empirical validation but may miss longer-term failure modes.

DeepMind’s scalable oversight research targets the 5-10 year timeframe when AI systems might exceed human capabilities in most domains. Their formal verification work requires significant theoretical development but could provide stronger safety guarantees. The timeline pressure creates tension between developing robust theoretical foundations and providing practical solutions for nearer-term systems.

ARC’s ELK research addresses problems that become critical as AI systems become sophisticated enough to understand and potentially manipulate human oversight. This could become relevant within 5-7 years as language models develop better theory of mind capabilities. However, fundamental progress on ELK has been limited, raising questions about whether the problem is solvable within relevant timelines.

Current trajectories suggest that multiple approaches will likely be needed rather than a single solution. Constitutional AI and control protocols provide immediate benefits for current systems. Interpretability and oversight methods could become crucial as capabilities scale. Foundational research might prevent fundamental errors in system design. The key question is resource allocation across these different timelines and approaches.

Key Technical and Strategic Uncertainties

The most fundamental uncertainty is whether current AI paradigms—large language models trained with reinforcement learning from human feedback—will lead directly to transformative AI. If scaling current approaches leads to AGI, then research on constitutional AI and scalable oversight directly addresses the systems we’ll need to align. If a paradigm shift occurs, much current safety research might not transfer to new architectures.

Another crucial uncertainty is the feasibility of verification without full alignment. Can we develop interpretability tools good enough to detect deceptive alignment? Can oversight methods scale to superintelligent systems? These questions determine whether we can safely deploy AI systems we don’t fully understand or trust—a scenario that seems increasingly likely given competitive pressures.

The tractability of fundamental theoretical problems remains unclear. Are problems like ELK solvable in principle? Do we need conceptual breakthroughs about agency and goals before building advanced AI? The answers affect how much effort should go into foundational research versus empirical approaches with current systems.

Strategic uncertainties include how different research agendas interact and whether they’re complements or substitutes. Does success in constitutional AI reduce the need for control protocols, or do we need defense in depth? How do competitive dynamics between AI developers affect the feasibility of different safety approaches? These considerations may be as important as the technical questions for determining which research directions to prioritize.

The comparison reveals that rather than competing approaches, we likely need a portfolio strategy that addresses different scenarios and timelines. The optimal allocation depends on beliefs about AI development trajectories, the tractability of different technical problems, and the strategic landscape for AI deployment. Understanding these trade-offs is essential for making effective research and policy decisions in an uncertain but rapidly evolving field.

Sources

📊Research Agenda Overview

Name	Primary Focus	Key Assumption	Timeline View	Theory of Change
Anthropic (Constitutional AI)	Train models to be helpful, harmless, honest via AI feedback	Can instill values through training at scale	Short (2026-2030)	Build safest frontier models → demonstrate safe scaling → set industry standard
Anthropic (Interpretability)	Understand model internals to verify alignment	Neural networks have interpretable structure	Short-medium	Mechanistic understanding → can verify goals → detect deception
OpenAI (Superalignment) [dissolved]	Use weaker AI to align stronger AI	Weak-to-strong generalization works	Short (2027-2030)	Bootstrap alignment from current models to future models
DeepMind (Scalable Oversight)	Debate, recursive reward modeling, process supervision	Can scale human oversight to superhuman AI	Medium (2030-2040)	Better evaluation → catch misalignment → iterate to safety
ARC (Eliciting Latent Knowledge)	Get AI to report what it actually knows	AI has latent knowledge that could be extracted	Medium	Solve ELK → detect deception → verify alignment
Redwood (AI Control)	Maintain control even with misaligned AI	Don't need alignment, just control	Near-term focus	Untrusted AI can be safely used with proper protocols
MIRI (Agent Foundations)	Fundamental theory of agency and goals	Need conceptual breakthroughs first	Uncertain, possibly insufficient time	Deep understanding → know what to build → align by design
SSI (Sutskever)	Safe superintelligence from ground up	Fresh start enables better safety integration	Long-term (superintelligence focus)	Build safety into capabilities from start → achieve safe superintelligence

📊Agenda Comparison by Dimension

Name	Solves Inner Alignment?	Scales to Superhuman?	Works Without Full Alignment?	Empirically Testable Now?
Constitutional AI	Uncertainmedium	Unknownmedium	Nolow	Yeshigh
Interpretability	Could help verifymedium	Unknownmedium	Helps detect issuesmedium	Yeshigh
Scalable Oversight	Nolow	Designed tohigh	Maybemedium	Partiallymedium
ELK	Would help detectmedium	Intended tomedium	Nolow	Limitedlow
AI Control	No (bypasses it)low	Probably notlow	Yes (the point)high	Yeshigh
Agent Foundations	Aims tomedium	Would if successfulmedium	Nolow	Nolow
SSI (Sutskever)	Unknown (stealth)medium	Primary goalhigh	Unknownmedium	No (stealth)low

⚖️Which Research Agenda Will Succeed?

Views on which approach is most likely to prevent AI catastrophe

Need novel breakthroughs

Current approaches scale

MIRI / Yudkowsky

15-20%

Current approaches fail

●●●

ARC / Paul Christiano

40-50%

Need targeted breakthroughs

●○○

DeepMind researchers

50-60%

Scalable oversight works

●●○

Redwood

Near-term high

Control buys time

●●○

SSI / Sutskever

Unknown

Fresh start needed

●○○

Anthropic researchers

60-70%

Current approaches can scale

●●○

❓Key Questions

Will current paradigm (LLMs + RLHF) lead to transformative AI?

Can we verify alignment without solving it?

AI Transition Model Context

Research agenda diversification improves the Ai Transition Model across multiple factors:

Factor	Parameter	Impact
Misalignment Potential	Alignment Robustness	Portfolio approach hedges against any single methodology failing
Misalignment Potential	Safety-Capability Gap	Multiple research paths increase probability of keeping pace with capabilities
Transition Turbulence	Racing Intensity	Coordinated safety investment across labs reduces competitive pressure to cut corners

The distribution of research funding across agendas significantly affects the probability of achieving safe AI development across different scenarios.