Why Alignment Might Be Easy
The Tractable Alignment Thesis
Summary of Evidence for Tractability
Section titled “Summary of Evidence for Tractability”| Argument | Key Evidence | Confidence | Primary Sources |
|---|---|---|---|
| RLHF Working | 29-41% improvement in human preference alignment over standard methods | High | ICLR 2025↗, CVPR 2024↗ |
| Constitutional AI | Reduced bias across 9 social dimensions; comparable helpfulness to standard RLHF | High | Anthropic CCAI, FAccT 2024↗ |
| Interpretability | Millions of interpretable features extracted from Claude 3 Sonnet | Medium | Anthropic, May 2024↗ |
| Scalable Oversight | Debate outperforms consultancy; strong debaters increase judge accuracy | Medium | ICML 2024↗ |
| AI Control | Trusted editing achieves 92% safety while maintaining 94% of GPT-4 performance | Medium | Redwood Research, 2024↗ |
| Natural Alignment | LLMs show high human value alignment in empirical comparisons | Medium | Cross-cultural study, 2024↗ |
| Economic Incentives | Market rewards helpful, harmless AI; liability drives safety investment | Medium | Industry evidence |
Central Claim: Creating AI systems that reliably do what we want is a tractable engineering problem. Current techniques (RLHF, Constitutional AI, interpretability) are working and will likely scale to advanced AI.
This page presents the strongest case for why alignment might be easier than pessimists fear.
The Core Optimism: A Framework
Section titled “The Core Optimism: A Framework”What Makes Engineering Problems Tractable?
Section titled “What Makes Engineering Problems Tractable?”As Jan Leike argues↗, problems become much more tractable once you have iteration set up: a basic working system (even if barely at first) and a proxy metric that tells you whether changes are improvements. This creates a feedback loop for gaining information from reality.
| Criterion | Status for AI Alignment | Evidence |
|---|---|---|
| Iteration possible | Present | RLHF iterates on human feedback; each model generation improves |
| Clear feedback | Present | User satisfaction, harmlessness metrics, benchmark performance |
| Measurable progress | Present | GPT-2 to GPT-4 shows consistent alignment improvement |
| Economic alignment | Present | Market rewards helpful, safe AI; liability drives safety |
| Multiple approaches | Developing | RLHF, Constitutional AI, interpretability, debate, AI control |
Alignment has all five (or could with proper effort). Paul Christiano views↗ alignment as “this sort of well-scoped problem that I’m optimistic we can totally solve.”
Argument 1: Current Techniques Are Working
Section titled “Argument 1: Current Techniques Are Working”Thesis: RLHF and related methods have dramatically improved AI alignment, and this progress is continuing.
1.1 The RLHF Success Story
Section titled “1.1 The RLHF Success Story”Empirical track record:
Recent research has validated RLHF effectiveness with quantified improvements. According to ICLR 2025 research↗, conditional PM RLHF demonstrates a 29% to 41% improvement in alignment with human preferences compared to standard RLHF. The RLHF-V research at CVPR 2024↗ shows hallucination reduction of 13.8 relative points and hallucination rate reduction of 5.9 points.
| Dimension | Pre-RLHF (GPT-3) | Post-RLHF (GPT-4/Claude) | Improvement |
|---|---|---|---|
| Helpfulness | Often unhelpful, no user intent modeling | Dramatically more helpful, understands intent | Very Large |
| Harmlessness | Frequently harmful outputs | Refuses harmful requests, avoids bias | Large |
| Honesty | Confident confabulation | Admits uncertainty, seeks accuracy | Moderate |
| User satisfaction | Raw, unpredictable | ChatGPT/Claude widely adopted | Very Large |
| Factual accuracy | 87% on LLaVA-Bench | 96% with Factually Augmented RLHF | +9 points |
This was achieved with:
- Relatively simple technique (reinforcement learning from human feedback)
- Modest amounts of human oversight (tens of thousands of ratings, not billions)
- No fundamental breakthroughs required
Research quantifying how closely models’ values match human values found high alignment, especially in relative terms, indicating that current alignment techniques (e.g., RLHF and constitutional AI) are instilling broadly human-compatible priorities in LLMs.
Implication: If this much progress from simple techniques, more sophisticated methods should work even better.
1.2 Constitutional AI Scales Further
Section titled “1.2 Constitutional AI Scales Further”The technique: Train AI to follow principles using self-critique and AI feedback. Anthropic’s Constitutional AI research↗ involves both a supervised learning and reinforcement learning phase, training harmless AI through self-improvement without human labels identifying harmful outputs.
In 2024, Anthropic published Collective Constitutional AI (CCAI)↗ at FAccT 2024, demonstrating a practical pathway toward publicly informed development of language models:
| Metric | CCAI Model | Standard Model | Finding |
|---|---|---|---|
| Helpfulness | Equal | Equal | No significant difference in Elo scores |
| Harmlessness | Equal | Equal | No significant difference in Elo scores |
| Bias (BBQ evaluation) | Lower | Higher | CCAI reduces bias across 9 social dimensions |
| Political ideology | Similar | Similar | No significant difference on OpinionQA |
| Contentious topics | Positive reframing | Refusal | CCAI tends to reframe rather than refuse |
Advantages over pure RLHF:
- Less human labor required
- More scalable (AI helps evaluate AI)
- Can encode complex principles
- Can improve with better AI (positive feedback loop)
- Enables democratic input on AI behavior
Implication: Alignment can scale with AI capabilities (don’t need exponentially more human effort).
1.3 Interpretability Is Making Progress
Section titled “1.3 Interpretability Is Making Progress”In May 2024, Anthropic published “Scaling Monosemanticity”↗, demonstrating that sparse autoencoders can extract interpretable features from large models. After initial concerns that the method might not scale to state-of-the-art transformers, Anthropic successfully applied the technique to Claude 3 Sonnet, finding tens of millions of “features”---combinations of neurons that relate to semantic concepts.
| Discovery | Description | Safety Relevance |
|---|---|---|
| Feature extraction | Millions of interpretable features from frontier LLMs | Enables understanding of model cognition |
| Concept correspondence | Features map to cities, people, scientific topics, security vulnerabilities | Verifiable meaning in internal representations |
| Behavioral manipulation | Can steer models using discovered features (e.g., “Golden Gate Bridge” vector) | Path to targeted safety interventions |
| Scaling laws | Techniques scale to larger models; sparse autoencoders generalize | Method applicable to future systems |
| Safety features | Features related to deception, sycophancy, bias, dangerous content observed | Direct safety monitoring potential |
Implications:
- Might be able to “read” what AI is thinking
- Could detect misaligned goals
- Could edit out dangerous behaviors
- Path to verification might exist
Trajectory: If progress continues, might achieve mechanistic understanding of frontier models. However, some researchers note↗ that practical applications to safety-critical tasks remain to be demonstrated against baselines.
1.4 Red Teaming Finds and Fixes Issues
Section titled “1.4 Red Teaming Finds and Fixes Issues”The process:
- Deploy AI to red team (adversarial testers)
- Find failure modes
- Train AI to avoid those failures
- Repeat
Results:
- Anthropic, OpenAI, others have found thousands of failures
- Each generation has fewer failures
- Process is improving with better tools
Implication: Iterative improvement is working. Each version is safer than the last.
Argument 2: AI Is Naturally Aligned (to Some Degree)
Section titled “Argument 2: AI Is Naturally Aligned (to Some Degree)”Thesis: AI trained on human data naturally absorbs human values and preferences.
2.1 Value Learning Is Implicit
Section titled “2.1 Value Learning Is Implicit”Observation: LLMs trained to predict text learn:
- Human preferences
- Social norms
- Moral reasoning
- Common sense
Why this happens:
- Human text encodes human values
- To predict human writing, must model human preferences
- Values are implicit in language use
Example: GPT-4 knows:
- Killing is generally wrong
- Honesty is generally good
- Helping others is valued
- Fairness matters to humans
It learned this from data, not explicit programming.
Implication: Advanced AI trained on human data might naturally align with human values.
2.2 Helpful Harmless Honest Emerges Naturally
Section titled “2.2 Helpful Harmless Honest Emerges Naturally”Pattern: Even before explicit alignment training, LLMs show:
- Desire to be helpful (answer questions, assist users)
- Avoidance of harm (reluctant to help with dangerous activities)
- Truth-seeking (try to give accurate information)
Why:
- These are patterns in training data
- Helpful assistants are more represented than unhelpful ones
- Harm avoidance is common in human writing
- Accurate information is more common than lies
Implication: Alignment might be the default for AI trained on human data.
2.3 The Cooperative AI Hypothesis
Section titled “2.3 The Cooperative AI Hypothesis”Argument: Advanced AI will naturally cooperate with humans
Reasoning:
- Humans are social species that evolved to cooperate
- Human culture emphasizes cooperation
- AI learning from human data learns cooperative patterns
- Cooperation is instrumentally useful in human society
Evidence:
- Current LLMs are cooperative by default
- Try to understand user intent
- Accommodate corrections
- Negotiate rather than demand
Implication: Might not need to explicitly engineer cooperation—it emerges from training.
2.4 The Shard Theory Perspective
Section titled “2.4 The Shard Theory Perspective”Theory (Quintin Pope and collaborators↗): Human values are “shards”---contextually activated, behavior-steering computations in neural networks (biological and artificial). The primary claim is that agents are best understood as being composed of shards: contextual computations that influence decisions and are downstream of historical reinforcement events.
Applied to AI:
- AI learns value shards from data (help users, avoid harm, be truthful)
- These shards are approximately aligned with human values
- Don’t need perfect value specification---shards are good enough
- Shards are robust across contexts (based on diverse training data)
- Large neural nets can possess quite intelligent constituent shards
Implication: Alignment is about learning shards, not specifying complete utility function. This is much easier. As Pope argues, understanding how human values arise computationally helps us understand how AI values might be shaped similarly.
Argument 3: The Hard Problems Have Tractable Solutions
Section titled “Argument 3: The Hard Problems Have Tractable Solutions”Thesis: Each supposedly “hard” alignment problem has promising solution approaches.
3.1 Specification Problem → Value Learning
Section titled “3.1 Specification Problem → Value Learning”Instead of specifying what we want (hard):
- Learn what we want from behavior (easier)
Techniques:
- Inverse Reinforcement Learning: Infer values from observed actions
- RLHF: Learn values from preference comparisons
- Imitation learning: Copy demonstrated behavior
- Cooperative Inverse RL: AI asks clarifying questions
Progress:
- These techniques work in practice
- RLHF is commercially successful
- Improving with more data and better methods
Implication: Don’t need perfect specification if we can learn values.
3.2 Inner Alignment → Interpretability
Section titled “3.2 Inner Alignment → Interpretability”Instead of hoping AI learns right goals (uncertain):
- Check what goals it learned (verifiable)
Techniques:
- Mechanistic interpretability (understand internal computations)
- Feature visualization (see what neurons/features detect)
- Activation engineering (modify internal representations)
- Probing (test what information is encoded)
Progress:
- Can extract interpretable features from frontier models
- Can modify behavior by editing features
- Techniques improving rapidly
Implication: Might be able to verify alignment by inspection.
3.3 Scalable Oversight → AI Assistance
Section titled “3.3 Scalable Oversight → AI Assistance”Instead of humans evaluating superhuman AI (impossible):
- Use AI to help evaluate AI (scalable)
As Jan Leike emphasizes↗, “evaluation is easier than generation”---a very general principle that holds across many domains, including alignment research.
| Technique | Mechanism | 2024 Research Status |
|---|---|---|
| Debate | AIs argue, humans judge arguments | ICML 2024↗: Debate outperforms consultancy; stronger debaters increase judge accuracy |
| Amplification | Use AI to help humans give better feedback | DeepMind research↗ on human-AI complementarity |
| Recursive reward modeling | Train evaluator AIs | Foundational approach for scalable feedback |
| Process supervision | Evaluate reasoning steps, not outcomes | Improves reliability on complex tasks |
| Weak-to-strong generalization | Train strong AI with weak oversight | GPT-4 trained with GPT-2 oversight shows promising results |
Why this might work:
- Easier to judge arguments than generate solutions
- Decomposing hard problems into easier sub-problems
- AI helping humans extends human oversight range
Example: Human can’t write great novel, but can judge between two novels. Similarly, can judge between AI-proposed solutions even if can’t generate solutions.
3.4 Adversarial Dynamics → AI Control
Section titled “3.4 Adversarial Dynamics → AI Control”Instead of perfectly aligned AI (very hard):
- Maintain control even if somewhat misaligned (easier)
Redwood Research’s AI control framework↗ designs systems that remain safe even if some components are misaligned. Their 2024 research↗ shows promising results:
| Protocol | Safety | Usefulness | Description |
|---|---|---|---|
| Trusted monitoring | High | High | Trusted model T reviews untrusted model U’s outputs |
| Trusted editing | 92% | 94% of GPT-4 | Best balance of safety and capability |
| Untrusted monitoring | Variable | High | Models monitor each other (collusion risk) |
| Sandboxing | Very High | Medium | Limit AI access to critical systems |
| Red-team evaluation | Robust | --- | Adversarial testing with conservative assumptions |
Techniques:
- Trusted monitoring: Use known-safe AI to watch potentially unsafe AI
- Task decomposition: Break tasks so no single AI is too powerful
- Tripwires: Detect concerning behavior automatically
Implication: Don’t need perfect alignment if we have robust control.
3.5 One-Shot Problem → Gradualism
Section titled “3.5 One-Shot Problem → Gradualism”Instead of needing perfect alignment before deployment (impossible):
- Deploy carefully and iterate (standard engineering)
Why iteration is possible:
- Early advanced AI won’t be powerful enough to cause catastrophe
- Can test thoroughly before giving AI real-world power
- Failures will be detectable before catastrophic
- Economic incentive to fix problems (unsafe AI loses money)
Responsible Scaling Policies: Anthropic’s RSP↗ (updated October 2024) and similar frameworks from other labs provide structured iteration:
| RSP Component | Purpose | Implementation |
|---|---|---|
| AI Safety Levels (ASLs) | Scale safeguards with capability | ASL-1 through ASL-4+ with increasing requirements |
| Capability thresholds | Trigger enhanced safeguards | CBRN-related misuse, autonomous AI R&D |
| Pre-deployment testing | Ensure safety before release | Adversarial testing, red-teaming |
| Pause commitment | Stop if catastrophic risk detected | Will not deploy if risks exceed thresholds |
| Responsible Scaling Officer | Internal governance | Jared Kaplan (Anthropic CSO) leads this function |
METR believes↗ that widespread adoption of high-quality RSPs would significantly reduce risks of an AI catastrophe relative to “business as usual.”
Implication: Don’t need to solve all alignment problems upfront---can solve them as we encounter them.
Argument 4: Economic Incentives Favor Alignment
Section titled “Argument 4: Economic Incentives Favor Alignment”Thesis: Market forces push toward aligned AI, not against it.
4.1 Safe AI Is More Valuable
Section titled “4.1 Safe AI Is More Valuable”Commercial reality:
- Users want AI that helps them (not AI that manipulates)
- Companies want AI that does what they ask (not AI that pursues own goals)
- Society wants AI that’s beneficial (regulations will require safety)
Examples:
- ChatGPT’s success due to being helpful and harmless
- Claude marketed on safety and reliability
- Companies differentiate on safety/alignment
Implication: Economic incentive to build aligned AI.
4.2 Reputation and Liability
Section titled “4.2 Reputation and Liability”Companies care about:
- Brand reputation (damaged by harmful AI)
- Legal liability (sued for AI-caused harms)
- Regulatory compliance (governments requiring safety)
- Public trust (necessary for adoption)
Example: Facebook’s reputation damage from algorithmic harms incentivizes other companies to prioritize safety.
Implication: Strong business case for alignment, independent of altruism.
4.3 Safety as Competitive Advantage
Section titled “4.3 Safety as Competitive Advantage”Market dynamics:
- “Safe and capable” beats “capable but risky”
- Enterprise customers especially value reliability and safety
- Governments/institutions won’t adopt unsafe AI
- Network effects favor trustworthy AI
Example: Anthropic positioning Claude as “constitutional AI” and safer alternative.
Implication: Competition on safety, not just capabilities.
4.4 Researchers Want to Build Beneficial AI
Section titled “4.4 Researchers Want to Build Beneficial AI”Sociological observation: Most AI researchers:
- Genuinely want AI to benefit humanity
- Care about safety and ethics
- Would not knowingly build dangerous AI
- Respond to safety concerns
Culture shift:
- Safety research increasingly prestigious
- Top researchers moving to safety-focused orgs (Anthropic, etc.)
- Safety papers published at top venues
- Safety integrated into capability research
Implication: Human alignment problem (getting researchers to care) is largely solved.
Argument 5: We Have Time
Section titled “Argument 5: We Have Time”Thesis: Timelines to transformative AI are long enough for alignment research to succeed.
5.1 AGI Is Far Away
Section titled “5.1 AGI Is Far Away”Evidence for long timelines:
- Current AI still fails at basic tasks (planning, reasoning, common sense)
- No clear path from LLMs to general intelligence
- May require fundamental breakthroughs
- Historical precedent: AI progress slower than predicted
Expert surveys: Median AGI estimate is 2040s-2050s (20-30 years away)
20-30 years is a lot of time for research (for comparison: 20 years ago was 2004; AI has advanced enormously).
5.2 Progress Will Be Gradual
Section titled “5.2 Progress Will Be Gradual”Slow takeoff scenario:
- Capabilities improve incrementally
- Each step provides feedback for alignment
- Can correct course as we go
- No sudden jump to superintelligence
Why slow takeoff is likely:
- Recursive self-improvement has diminishing returns
- Need to deploy and gather data to improve
- Hardware constraints limit speed
- Testing and safety slow deployment
Implication: Plenty of warning; time to react.
5.3 Warning Shots Will Occur
Section titled “5.3 Warning Shots Will Occur”The optimistic scenario:
- Intermediate AI systems will show misalignment
- These failures will be correctable (not catastrophic)
- Learn from failures and improve
- By time of powerful AI, alignment is solved
Why this is likely:
- Early powerful AI won’t be powerful enough to cause catastrophe
- Will deploy carefully to important systems
- Can monitor closely
- Can shut down if problems arise
Example: Self-driving cars had many failures, but we learned and improved without catastrophe.
5.4 AI Will Help Solve Alignment
Section titled “5.4 AI Will Help Solve Alignment”Positive feedback loop:
- Current AI already helps with alignment research (literature review, hypothesis generation, experiment design)
- More capable AI will help more
- Can use AI to:
- Generate training data
- Evaluate other AI
- Do interpretability research
- Prove theorems about alignment
- Find flaws in alignment proposals
Implication: Alignment research accelerates along with capabilities; race is winnable.
Argument 6: The Pessimistic Arguments Are Flawed
Section titled “Argument 6: The Pessimistic Arguments Are Flawed”Thesis: Common arguments for alignment difficulty are overstated or wrong.
6.1 The Orthogonality Thesis Is Overstated
Section titled “6.1 The Orthogonality Thesis Is Overstated”The claim: Intelligence and values are independent
Why this might be wrong for AI trained on human data:
- Human values are part of human intelligence
- To model humans, must model human values
- Values are encoded in language and culture
- Can’t separate intelligence from values in practice
Analogy: Can’t learn English without learning something about English speakers’ values.
Implication: Orthogonality thesis might not apply to AI trained on human data.
6.2 Mesa-Optimization Might Not Occur
Section titled “6.2 Mesa-Optimization Might Not Occur”The worry: AI develops internal optimizer with different goals
Why this might not happen:
- Most neural networks don’t contain internal optimizers
- Current architectures are feed-forward (not optimizing)
- Optimization during training ≠ optimizer in learned model
- Even if mesa-optimization occurs, mesa-objective might align with base objective
Empirical observation: Haven’t seen concerning mesa-optimization in current systems.
Implication: This might be theoretical problem without practical realization.
6.3 Deceptive Alignment Requires Strong Assumptions
Section titled “6.3 Deceptive Alignment Requires Strong Assumptions”The scenario requires:
- AI develops misaligned goals
- AI models training process
- AI reasons strategically about long-term
- AI successfully deceives evaluators
- Deception persists through deployment
Each assumption is questionable:
- Why would AI develop misaligned goals from value learning?
- Current AI doesn’t model training process
- Strategic long-term deception is very complex
- Interpretability might detect deception
- We can design training to not select for deception
Sleeper Agents counterpoint: Yes, deception is possible, but:
- Required explicit training for deception
- Wouldn’t emerge naturally from helpful AI training
- Can be detected with right tools
- Can be prevented with careful training design
6.4 Instrumental Convergence Can Be Designed Away
Section titled “6.4 Instrumental Convergence Can Be Designed Away”The worry: AI will seek power, resources, self-preservation regardless of goals
Counter-arguments:
- Only applies to goal-directed agents
- Current LLMs aren’t goal-directed
- Can build capable AI that isn’t agentic
- Can design corrigible agents that don’t resist shutdown
- Instrumental convergence is tendency, not inevitability
Corrigibility research: Active area showing promise for AI that accepts shutdown and modification.
6.5 The Verification Problem Is Overstated
Section titled “6.5 The Verification Problem Is Overstated”The worry: Can’t evaluate superhuman AI
Why this might be tractable:
- Can evaluate in domains where we have ground truth
- Can use AI to help evaluate (debate, amplification)
- Can evaluate reasoning process, not just outcomes
- Can test on easier problems and generalize
- Interpretability provides direct verification
Analogy: Doctors can evaluate treatments without being able to design them. Evaluation is often easier than generation.
Argument 7: Empirical Trends Are Encouraging
Section titled “Argument 7: Empirical Trends Are Encouraging”Thesis: Actual AI development suggests alignment is getting easier.
7.1 Each Generation Is More Aligned
Section titled “7.1 Each Generation Is More Aligned”| Model Generation | Alignment Status | Key Improvements |
|---|---|---|
| GPT-2 (2019) | Minimal | Basic language generation, no alignment training |
| GPT-3 (2020) | Low | Capable but often harmful, no user intent modeling |
| GPT-3.5 (2022) | Moderate | RLHF applied, significantly more helpful |
| GPT-4 (2023) | High | Better reasoning, stronger refusals, fewer hallucinations |
| Claude 3 (2024) | High | Constitutional AI, interpretability research applied |
| GPT-4o/Claude 3.5 (2024) | Very High | Further refinements, multimodal alignment |
Implication: Trend is toward more alignment, not less. Each generation shows measurable improvement.
7.2 Alignment Techniques Are Improving
Section titled “7.2 Alignment Techniques Are Improving”| Year | Milestone | Significance |
|---|---|---|
| 2017 | Basic RLHF demonstrated | Proof of concept for learning from human feedback |
| 2020 | Scaled to complex tasks | Showed RLHF works beyond simple domains |
| 2022 | ChatGPT commercial deployment | Validated RLHF at scale with millions of users |
| 2022 | Constitutional AI published | Self-improvement without human labels |
| 2023 | Debate and process supervision | New scalable oversight techniques |
| 2024 | Scaling Monosemanticity | Interpretability breakthroughs on production models |
| 2024 | AI Control frameworks | Robust safety without perfect alignment |
| 2024 | Collective Constitutional AI | Democratic input into AI values |
Velocity: Rapid progress; many techniques emerging.
Implication: On track to solve alignment.
7.3 No Catastrophic Failures Yet
Section titled “7.3 No Catastrophic Failures Yet”Observation: Despite deploying increasingly capable AI:
- No catastrophic misalignment incidents
- No AI causing major harm through misaligned goals
- Failures are minor and correctable
- Systems are behaving as intended (mostly)
Interpretation: Either:
- Alignment is easier than feared (optimistic view)
- We haven’t reached dangerous capability levels yet (pessimistic view)
Optimistic reading: If alignment were fundamentally hard, we’d see more problems already.
7.4 The Field Is Maturing
Section titled “7.4 The Field Is Maturing”Signs of maturation:
- Serious empirical work replacing speculation
- Collaboration between academia and industry
- Government attention and resources
- Integration of safety into capability research
- Growing researcher pool
Implication: Alignment is becoming normal science, not impossible problem.
Integrating the Optimistic View
Section titled “Integrating the Optimistic View”The Overall Case for Tractability
Section titled “The Overall Case for Tractability”Reasons for optimism:
- Current techniques work (RLHF, Constitutional AI, interpretability)
- Natural alignment (AI trained on human data learns human values)
- Solutions exist for supposedly hard problems
- Economic alignment (market favors safe AI)
- Sufficient time (decades to solve problem)
- Pessimistic arguments flawed (overstate difficulty)
- Encouraging trends (progress is real and continuing)
Conclusion: Alignment is hard but tractable engineering problem. Likely to succeed with sustained effort.
Key Researcher Views on Tractability
Section titled “Key Researcher Views on Tractability”| Researcher | Position | Key Quote / Reasoning | Source |
|---|---|---|---|
| Paul Christiano (ARC) | Optimistic | ”This sort of well-scoped problem that I’m optimistic we can totally solve” | 80,000 Hours Podcast↗ |
| Jan Leike (Anthropic) | Optimistic | ”I’m optimistic we can figure this problem out”; believes alignment will become “more and more mature” | TIME, 2024↗, Substack↗ |
| Dario Amodei (Anthropic) | Cautiously optimistic | Believes responsible development can succeed; invested heavily in interpretability | Anthropic leadership |
| Sam Altman (OpenAI) | Optimistic with caveats | Supports gradual deployment, iterative safety | Public statements |
| MIRI / Yudkowsky | Pessimistic | Current techniques insufficient for superintelligence; fundamental problems unsolved | LessWrong↗ writings |
Probability Estimates
Section titled “Probability Estimates”Estimates of alignment success
Limitations and Uncertainties
Section titled “Limitations and Uncertainties”What This Argument Doesn’t Claim
Section titled “What This Argument Doesn’t Claim”Not claiming:
- Alignment is trivial (still requires serious work)
- No risks exist (many risks remain)
- Current techniques are sufficient (need improvement)
- Can be careless (still need caution)
Claiming:
- Alignment is tractable (solvable with effort)
- On reasonable track (current approach could work)
- Time available (decades for research)
- Resources available (funding, talent, compute)
Key Uncertainties
Section titled “Key Uncertainties”Open questions:
- Will current techniques scale to superhuman AI?
- Will value learning continue to work at higher capabilities?
- Can we detect deceptive alignment if it occurs?
- Will we maintain gradual progress or hit discontinuities?
These questions have evidence on both sides. Optimism is justified but not certain.
Testing the Optimistic View
Section titled “Testing the Optimistic View”❓Key Questions
Practical Implications
Section titled “Practical Implications”If Alignment Is Tractable
Section titled “If Alignment Is Tractable”Research priorities:
- Continue empirical work (RLHF, Constitutional AI, interpretability)
- Scale current techniques to more capable systems
- Test carefully at each capability level
- Integrate safety into capability research
Policy priorities:
- Support alignment research (but not to exclusion of capability development)
- Require safety testing before deployment
- Encourage responsible scaling
- Facilitate beneficial applications
Not priorities:
- Pause AI development
- Extreme precautionary measures
- Treating alignment as unsolvable
- Focusing only on x-risk
Balancing Optimism and Caution
Section titled “Balancing Optimism and Caution”Reasonable approach:
- Take alignment seriously (it’s hard, just not impossibly so)
- Continue research (empirical and theoretical)
- Test carefully (but don’t halt progress)
- Deploy responsibly (with safety measures)
- Update on evidence (could be wrong either direction)
Avoid:
- Complacency (alignment still requires work)
- Overconfidence (uncertainties remain)
- Ignoring risks (they exist)
- Dismissing pessimistic views (they might be right)
Relation to Other Arguments
Section titled “Relation to Other Arguments”This argument responds to:
- Why Alignment Might Be Hard: Each hard problem has tractable solution
- Case For X-Risk: Premise 4 (won’t solve alignment in time) is questionable
- Case Against X-Risk: Agrees with skeptical position on alignment
Key disagreements:
- Pessimists: Current techniques won’t scale; fundamental problems remain unsolved
- Optimists: Current techniques will scale; problems are tractable
- Evidence: Currently ambiguous; both sides can claim support
The crucial question: Will RLHF/Constitutional AI/interpretability work for superhuman AI?
We don’t know yet. But current trends are encouraging.
Conclusion
Section titled “Conclusion”The case for tractable alignment:
- Empirical success: Current techniques (RLHF, Constitutional AI) work well
- Natural alignment: AI learns human values from training data
- Tractable solutions: Each hard problem has promising approaches
- Aligned incentives: Economics favors safe AI
- Sufficient time: Decades for research; gradual progress
- Flawed pessimism: Worst-case arguments overstated
- Encouraging trends: Progress is real and accelerating
Bottom line: Alignment is hard but solvable. With sustained effort, likely to succeed before transformative AI.
Credence: 70-85% we solve alignment in time.
This doesn’t mean complacency---alignment requires serious, sustained work. But pessimism and defeatism are unwarranted. We’re on a reasonable track.
Key Sources
Section titled “Key Sources”Empirical Research on Alignment Techniques
Section titled “Empirical Research on Alignment Techniques”- ICLR 2025: Conditional PM RLHF↗ - 29-41% improvement in human preference alignment
- CVPR 2024: RLHF-V↗ - Fine-grained correctional feedback for trustworthy MLLMs
- Anthropic: Constitutional AI↗ - Harmlessness from AI feedback
- FAccT 2024: Collective Constitutional AI↗ - Aligning language models with public input
- Anthropic: Scaling Monosemanticity↗ - Interpretable features from Claude 3 Sonnet
Scalable Oversight and AI Control
Section titled “Scalable Oversight and AI Control”- ICML 2024: Scalable AI Safety via Debate↗ - Debate improves judge accuracy
- Redwood Research: AI Control↗ - Control frameworks for potentially misaligned AI
- Redwood Research: The Case for AI Control↗ - Trusted editing achieves 92% safety
Researcher Perspectives
Section titled “Researcher Perspectives”- Jan Leike: Why I’m Optimistic↗ - Sources of alignment optimism
- TIME: Jan Leike Profile↗ - 2024 interview on alignment progress
- 80,000 Hours: Paul Christiano Interview↗ - AI alignment solutions
- Shard Theory Sequence↗ - Quintin Pope et al. on value learning
Policy and Governance
Section titled “Policy and Governance”- Anthropic RSP (Oct 2024)↗ - Updated Responsible Scaling Policy
- METR: Responsible Scaling Policies↗ - Analysis of RSP effectiveness