Skip to content

Why Alignment Might Be Easy

📋Page Status
Quality:82 (Comprehensive)
Importance:78 (High)
Last edited:2025-12-28 (10 days ago)
Words:3.9k
Structure:
📊 11📈 2🔗 44📚 038%Score: 10/15
LLM Summary:Presents empirical arguments that AI alignment is tractable based on RLHF/Constitutional AI success, natural value learning from human data, and progress in interpretability. Claims current techniques show massive improvements in helpfulness/harmlessness and that alignment might emerge naturally from training on human text.
Entry

The Tractable Alignment Thesis

ThesisAligning AI with human values is achievable with current or near-term techniques
ImplicationCan pursue beneficial AI without extreme precaution
Key EvidenceEmpirical progress and theoretical reasons for optimism
ArgumentKey EvidenceConfidencePrimary Sources
RLHF Working29-41% improvement in human preference alignment over standard methodsHighICLR 2025, CVPR 2024
Constitutional AIReduced bias across 9 social dimensions; comparable helpfulness to standard RLHFHighAnthropic CCAI, FAccT 2024
InterpretabilityMillions of interpretable features extracted from Claude 3 SonnetMediumAnthropic, May 2024
Scalable OversightDebate outperforms consultancy; strong debaters increase judge accuracyMediumICML 2024
AI ControlTrusted editing achieves 92% safety while maintaining 94% of GPT-4 performanceMediumRedwood Research, 2024
Natural AlignmentLLMs show high human value alignment in empirical comparisonsMediumCross-cultural study, 2024
Economic IncentivesMarket rewards helpful, harmless AI; liability drives safety investmentMediumIndustry evidence

Central Claim: Creating AI systems that reliably do what we want is a tractable engineering problem. Current techniques (RLHF, Constitutional AI, interpretability) are working and will likely scale to advanced AI.

This page presents the strongest case for why alignment might be easier than pessimists fear.

What Makes Engineering Problems Tractable?

Section titled “What Makes Engineering Problems Tractable?”

As Jan Leike argues, problems become much more tractable once you have iteration set up: a basic working system (even if barely at first) and a proxy metric that tells you whether changes are improvements. This creates a feedback loop for gaining information from reality.

Loading diagram...
CriterionStatus for AI AlignmentEvidence
Iteration possiblePresentRLHF iterates on human feedback; each model generation improves
Clear feedbackPresentUser satisfaction, harmlessness metrics, benchmark performance
Measurable progressPresentGPT-2 to GPT-4 shows consistent alignment improvement
Economic alignmentPresentMarket rewards helpful, safe AI; liability drives safety
Multiple approachesDevelopingRLHF, Constitutional AI, interpretability, debate, AI control

Alignment has all five (or could with proper effort). Paul Christiano views alignment as “this sort of well-scoped problem that I’m optimistic we can totally solve.”

Argument 1: Current Techniques Are Working

Section titled “Argument 1: Current Techniques Are Working”

Thesis: RLHF and related methods have dramatically improved AI alignment, and this progress is continuing.

Empirical track record:

Recent research has validated RLHF effectiveness with quantified improvements. According to ICLR 2025 research, conditional PM RLHF demonstrates a 29% to 41% improvement in alignment with human preferences compared to standard RLHF. The RLHF-V research at CVPR 2024 shows hallucination reduction of 13.8 relative points and hallucination rate reduction of 5.9 points.

DimensionPre-RLHF (GPT-3)Post-RLHF (GPT-4/Claude)Improvement
HelpfulnessOften unhelpful, no user intent modelingDramatically more helpful, understands intentVery Large
HarmlessnessFrequently harmful outputsRefuses harmful requests, avoids biasLarge
HonestyConfident confabulationAdmits uncertainty, seeks accuracyModerate
User satisfactionRaw, unpredictableChatGPT/Claude widely adoptedVery Large
Factual accuracy87% on LLaVA-Bench96% with Factually Augmented RLHF+9 points

This was achieved with:

  • Relatively simple technique (reinforcement learning from human feedback)
  • Modest amounts of human oversight (tens of thousands of ratings, not billions)
  • No fundamental breakthroughs required

Research quantifying how closely models’ values match human values found high alignment, especially in relative terms, indicating that current alignment techniques (e.g., RLHF and constitutional AI) are instilling broadly human-compatible priorities in LLMs.

Implication: If this much progress from simple techniques, more sophisticated methods should work even better.

The technique: Train AI to follow principles using self-critique and AI feedback. Anthropic’s Constitutional AI research involves both a supervised learning and reinforcement learning phase, training harmless AI through self-improvement without human labels identifying harmful outputs.

In 2024, Anthropic published Collective Constitutional AI (CCAI) at FAccT 2024, demonstrating a practical pathway toward publicly informed development of language models:

MetricCCAI ModelStandard ModelFinding
HelpfulnessEqualEqualNo significant difference in Elo scores
HarmlessnessEqualEqualNo significant difference in Elo scores
Bias (BBQ evaluation)LowerHigherCCAI reduces bias across 9 social dimensions
Political ideologySimilarSimilarNo significant difference on OpinionQA
Contentious topicsPositive reframingRefusalCCAI tends to reframe rather than refuse

Advantages over pure RLHF:

  • Less human labor required
  • More scalable (AI helps evaluate AI)
  • Can encode complex principles
  • Can improve with better AI (positive feedback loop)
  • Enables democratic input on AI behavior

Implication: Alignment can scale with AI capabilities (don’t need exponentially more human effort).

In May 2024, Anthropic published “Scaling Monosemanticity”, demonstrating that sparse autoencoders can extract interpretable features from large models. After initial concerns that the method might not scale to state-of-the-art transformers, Anthropic successfully applied the technique to Claude 3 Sonnet, finding tens of millions of “features”---combinations of neurons that relate to semantic concepts.

DiscoveryDescriptionSafety Relevance
Feature extractionMillions of interpretable features from frontier LLMsEnables understanding of model cognition
Concept correspondenceFeatures map to cities, people, scientific topics, security vulnerabilitiesVerifiable meaning in internal representations
Behavioral manipulationCan steer models using discovered features (e.g., “Golden Gate Bridge” vector)Path to targeted safety interventions
Scaling lawsTechniques scale to larger models; sparse autoencoders generalizeMethod applicable to future systems
Safety featuresFeatures related to deception, sycophancy, bias, dangerous content observedDirect safety monitoring potential

Implications:

  • Might be able to “read” what AI is thinking
  • Could detect misaligned goals
  • Could edit out dangerous behaviors
  • Path to verification might exist

Trajectory: If progress continues, might achieve mechanistic understanding of frontier models. However, some researchers note that practical applications to safety-critical tasks remain to be demonstrated against baselines.

The process:

  1. Deploy AI to red team (adversarial testers)
  2. Find failure modes
  3. Train AI to avoid those failures
  4. Repeat

Results:

  • Anthropic, OpenAI, others have found thousands of failures
  • Each generation has fewer failures
  • Process is improving with better tools

Implication: Iterative improvement is working. Each version is safer than the last.

Argument 2: AI Is Naturally Aligned (to Some Degree)

Section titled “Argument 2: AI Is Naturally Aligned (to Some Degree)”

Thesis: AI trained on human data naturally absorbs human values and preferences.

Observation: LLMs trained to predict text learn:

  • Human preferences
  • Social norms
  • Moral reasoning
  • Common sense

Why this happens:

  • Human text encodes human values
  • To predict human writing, must model human preferences
  • Values are implicit in language use

Example: GPT-4 knows:

  • Killing is generally wrong
  • Honesty is generally good
  • Helping others is valued
  • Fairness matters to humans

It learned this from data, not explicit programming.

Implication: Advanced AI trained on human data might naturally align with human values.

2.2 Helpful Harmless Honest Emerges Naturally

Section titled “2.2 Helpful Harmless Honest Emerges Naturally”

Pattern: Even before explicit alignment training, LLMs show:

  • Desire to be helpful (answer questions, assist users)
  • Avoidance of harm (reluctant to help with dangerous activities)
  • Truth-seeking (try to give accurate information)

Why:

  • These are patterns in training data
  • Helpful assistants are more represented than unhelpful ones
  • Harm avoidance is common in human writing
  • Accurate information is more common than lies

Implication: Alignment might be the default for AI trained on human data.

Argument: Advanced AI will naturally cooperate with humans

Reasoning:

  • Humans are social species that evolved to cooperate
  • Human culture emphasizes cooperation
  • AI learning from human data learns cooperative patterns
  • Cooperation is instrumentally useful in human society

Evidence:

  • Current LLMs are cooperative by default
  • Try to understand user intent
  • Accommodate corrections
  • Negotiate rather than demand

Implication: Might not need to explicitly engineer cooperation—it emerges from training.

Theory (Quintin Pope and collaborators): Human values are “shards”---contextually activated, behavior-steering computations in neural networks (biological and artificial). The primary claim is that agents are best understood as being composed of shards: contextual computations that influence decisions and are downstream of historical reinforcement events.

Applied to AI:

  • AI learns value shards from data (help users, avoid harm, be truthful)
  • These shards are approximately aligned with human values
  • Don’t need perfect value specification---shards are good enough
  • Shards are robust across contexts (based on diverse training data)
  • Large neural nets can possess quite intelligent constituent shards

Implication: Alignment is about learning shards, not specifying complete utility function. This is much easier. As Pope argues, understanding how human values arise computationally helps us understand how AI values might be shaped similarly.

Argument 3: The Hard Problems Have Tractable Solutions

Section titled “Argument 3: The Hard Problems Have Tractable Solutions”

Thesis: Each supposedly “hard” alignment problem has promising solution approaches.

Loading diagram...

3.1 Specification Problem → Value Learning

Section titled “3.1 Specification Problem → Value Learning”

Instead of specifying what we want (hard):

  • Learn what we want from behavior (easier)

Techniques:

  • Inverse Reinforcement Learning: Infer values from observed actions
  • RLHF: Learn values from preference comparisons
  • Imitation learning: Copy demonstrated behavior
  • Cooperative Inverse RL: AI asks clarifying questions

Progress:

  • These techniques work in practice
  • RLHF is commercially successful
  • Improving with more data and better methods

Implication: Don’t need perfect specification if we can learn values.

Instead of hoping AI learns right goals (uncertain):

  • Check what goals it learned (verifiable)

Techniques:

  • Mechanistic interpretability (understand internal computations)
  • Feature visualization (see what neurons/features detect)
  • Activation engineering (modify internal representations)
  • Probing (test what information is encoded)

Progress:

  • Can extract interpretable features from frontier models
  • Can modify behavior by editing features
  • Techniques improving rapidly

Implication: Might be able to verify alignment by inspection.

Instead of humans evaluating superhuman AI (impossible):

  • Use AI to help evaluate AI (scalable)

As Jan Leike emphasizes, “evaluation is easier than generation”---a very general principle that holds across many domains, including alignment research.

TechniqueMechanism2024 Research Status
DebateAIs argue, humans judge argumentsICML 2024: Debate outperforms consultancy; stronger debaters increase judge accuracy
AmplificationUse AI to help humans give better feedbackDeepMind research on human-AI complementarity
Recursive reward modelingTrain evaluator AIsFoundational approach for scalable feedback
Process supervisionEvaluate reasoning steps, not outcomesImproves reliability on complex tasks
Weak-to-strong generalizationTrain strong AI with weak oversightGPT-4 trained with GPT-2 oversight shows promising results

Why this might work:

  • Easier to judge arguments than generate solutions
  • Decomposing hard problems into easier sub-problems
  • AI helping humans extends human oversight range

Example: Human can’t write great novel, but can judge between two novels. Similarly, can judge between AI-proposed solutions even if can’t generate solutions.

Instead of perfectly aligned AI (very hard):

  • Maintain control even if somewhat misaligned (easier)

Redwood Research’s AI control framework designs systems that remain safe even if some components are misaligned. Their 2024 research shows promising results:

ProtocolSafetyUsefulnessDescription
Trusted monitoringHighHighTrusted model T reviews untrusted model U’s outputs
Trusted editing92%94% of GPT-4Best balance of safety and capability
Untrusted monitoringVariableHighModels monitor each other (collusion risk)
SandboxingVery HighMediumLimit AI access to critical systems
Red-team evaluationRobust---Adversarial testing with conservative assumptions

Techniques:

  • Trusted monitoring: Use known-safe AI to watch potentially unsafe AI
  • Task decomposition: Break tasks so no single AI is too powerful
  • Tripwires: Detect concerning behavior automatically

Implication: Don’t need perfect alignment if we have robust control.

Instead of needing perfect alignment before deployment (impossible):

  • Deploy carefully and iterate (standard engineering)

Why iteration is possible:

  • Early advanced AI won’t be powerful enough to cause catastrophe
  • Can test thoroughly before giving AI real-world power
  • Failures will be detectable before catastrophic
  • Economic incentive to fix problems (unsafe AI loses money)

Responsible Scaling Policies: Anthropic’s RSP (updated October 2024) and similar frameworks from other labs provide structured iteration:

RSP ComponentPurposeImplementation
AI Safety Levels (ASLs)Scale safeguards with capabilityASL-1 through ASL-4+ with increasing requirements
Capability thresholdsTrigger enhanced safeguardsCBRN-related misuse, autonomous AI R&D
Pre-deployment testingEnsure safety before releaseAdversarial testing, red-teaming
Pause commitmentStop if catastrophic risk detectedWill not deploy if risks exceed thresholds
Responsible Scaling OfficerInternal governanceJared Kaplan (Anthropic CSO) leads this function

METR believes that widespread adoption of high-quality RSPs would significantly reduce risks of an AI catastrophe relative to “business as usual.”

Implication: Don’t need to solve all alignment problems upfront---can solve them as we encounter them.

Argument 4: Economic Incentives Favor Alignment

Section titled “Argument 4: Economic Incentives Favor Alignment”

Thesis: Market forces push toward aligned AI, not against it.

Commercial reality:

  • Users want AI that helps them (not AI that manipulates)
  • Companies want AI that does what they ask (not AI that pursues own goals)
  • Society wants AI that’s beneficial (regulations will require safety)

Examples:

  • ChatGPT’s success due to being helpful and harmless
  • Claude marketed on safety and reliability
  • Companies differentiate on safety/alignment

Implication: Economic incentive to build aligned AI.

Companies care about:

  • Brand reputation (damaged by harmful AI)
  • Legal liability (sued for AI-caused harms)
  • Regulatory compliance (governments requiring safety)
  • Public trust (necessary for adoption)

Example: Facebook’s reputation damage from algorithmic harms incentivizes other companies to prioritize safety.

Implication: Strong business case for alignment, independent of altruism.

Market dynamics:

  • “Safe and capable” beats “capable but risky”
  • Enterprise customers especially value reliability and safety
  • Governments/institutions won’t adopt unsafe AI
  • Network effects favor trustworthy AI

Example: Anthropic positioning Claude as “constitutional AI” and safer alternative.

Implication: Competition on safety, not just capabilities.

4.4 Researchers Want to Build Beneficial AI

Section titled “4.4 Researchers Want to Build Beneficial AI”

Sociological observation: Most AI researchers:

  • Genuinely want AI to benefit humanity
  • Care about safety and ethics
  • Would not knowingly build dangerous AI
  • Respond to safety concerns

Culture shift:

  • Safety research increasingly prestigious
  • Top researchers moving to safety-focused orgs (Anthropic, etc.)
  • Safety papers published at top venues
  • Safety integrated into capability research

Implication: Human alignment problem (getting researchers to care) is largely solved.

Thesis: Timelines to transformative AI are long enough for alignment research to succeed.

Evidence for long timelines:

  • Current AI still fails at basic tasks (planning, reasoning, common sense)
  • No clear path from LLMs to general intelligence
  • May require fundamental breakthroughs
  • Historical precedent: AI progress slower than predicted

Expert surveys: Median AGI estimate is 2040s-2050s (20-30 years away)

20-30 years is a lot of time for research (for comparison: 20 years ago was 2004; AI has advanced enormously).

Slow takeoff scenario:

  • Capabilities improve incrementally
  • Each step provides feedback for alignment
  • Can correct course as we go
  • No sudden jump to superintelligence

Why slow takeoff is likely:

  • Recursive self-improvement has diminishing returns
  • Need to deploy and gather data to improve
  • Hardware constraints limit speed
  • Testing and safety slow deployment

Implication: Plenty of warning; time to react.

The optimistic scenario:

  • Intermediate AI systems will show misalignment
  • These failures will be correctable (not catastrophic)
  • Learn from failures and improve
  • By time of powerful AI, alignment is solved

Why this is likely:

  • Early powerful AI won’t be powerful enough to cause catastrophe
  • Will deploy carefully to important systems
  • Can monitor closely
  • Can shut down if problems arise

Example: Self-driving cars had many failures, but we learned and improved without catastrophe.

Positive feedback loop:

  • Current AI already helps with alignment research (literature review, hypothesis generation, experiment design)
  • More capable AI will help more
  • Can use AI to:
    • Generate training data
    • Evaluate other AI
    • Do interpretability research
    • Prove theorems about alignment
    • Find flaws in alignment proposals

Implication: Alignment research accelerates along with capabilities; race is winnable.

Argument 6: The Pessimistic Arguments Are Flawed

Section titled “Argument 6: The Pessimistic Arguments Are Flawed”

Thesis: Common arguments for alignment difficulty are overstated or wrong.

6.1 The Orthogonality Thesis Is Overstated

Section titled “6.1 The Orthogonality Thesis Is Overstated”

The claim: Intelligence and values are independent

Why this might be wrong for AI trained on human data:

  • Human values are part of human intelligence
  • To model humans, must model human values
  • Values are encoded in language and culture
  • Can’t separate intelligence from values in practice

Analogy: Can’t learn English without learning something about English speakers’ values.

Implication: Orthogonality thesis might not apply to AI trained on human data.

The worry: AI develops internal optimizer with different goals

Why this might not happen:

  • Most neural networks don’t contain internal optimizers
  • Current architectures are feed-forward (not optimizing)
  • Optimization during training ≠ optimizer in learned model
  • Even if mesa-optimization occurs, mesa-objective might align with base objective

Empirical observation: Haven’t seen concerning mesa-optimization in current systems.

Implication: This might be theoretical problem without practical realization.

6.3 Deceptive Alignment Requires Strong Assumptions

Section titled “6.3 Deceptive Alignment Requires Strong Assumptions”

The scenario requires:

  1. AI develops misaligned goals
  2. AI models training process
  3. AI reasons strategically about long-term
  4. AI successfully deceives evaluators
  5. Deception persists through deployment

Each assumption is questionable:

  • Why would AI develop misaligned goals from value learning?
  • Current AI doesn’t model training process
  • Strategic long-term deception is very complex
  • Interpretability might detect deception
  • We can design training to not select for deception

Sleeper Agents counterpoint: Yes, deception is possible, but:

  • Required explicit training for deception
  • Wouldn’t emerge naturally from helpful AI training
  • Can be detected with right tools
  • Can be prevented with careful training design

6.4 Instrumental Convergence Can Be Designed Away

Section titled “6.4 Instrumental Convergence Can Be Designed Away”

The worry: AI will seek power, resources, self-preservation regardless of goals

Counter-arguments:

  • Only applies to goal-directed agents
  • Current LLMs aren’t goal-directed
  • Can build capable AI that isn’t agentic
  • Can design corrigible agents that don’t resist shutdown
  • Instrumental convergence is tendency, not inevitability

Corrigibility research: Active area showing promise for AI that accepts shutdown and modification.

6.5 The Verification Problem Is Overstated

Section titled “6.5 The Verification Problem Is Overstated”

The worry: Can’t evaluate superhuman AI

Why this might be tractable:

  • Can evaluate in domains where we have ground truth
  • Can use AI to help evaluate (debate, amplification)
  • Can evaluate reasoning process, not just outcomes
  • Can test on easier problems and generalize
  • Interpretability provides direct verification

Analogy: Doctors can evaluate treatments without being able to design them. Evaluation is often easier than generation.

Section titled “Argument 7: Empirical Trends Are Encouraging”

Thesis: Actual AI development suggests alignment is getting easier.

Model GenerationAlignment StatusKey Improvements
GPT-2 (2019)MinimalBasic language generation, no alignment training
GPT-3 (2020)LowCapable but often harmful, no user intent modeling
GPT-3.5 (2022)ModerateRLHF applied, significantly more helpful
GPT-4 (2023)HighBetter reasoning, stronger refusals, fewer hallucinations
Claude 3 (2024)HighConstitutional AI, interpretability research applied
GPT-4o/Claude 3.5 (2024)Very HighFurther refinements, multimodal alignment

Implication: Trend is toward more alignment, not less. Each generation shows measurable improvement.

YearMilestoneSignificance
2017Basic RLHF demonstratedProof of concept for learning from human feedback
2020Scaled to complex tasksShowed RLHF works beyond simple domains
2022ChatGPT commercial deploymentValidated RLHF at scale with millions of users
2022Constitutional AI publishedSelf-improvement without human labels
2023Debate and process supervisionNew scalable oversight techniques
2024Scaling MonosemanticityInterpretability breakthroughs on production models
2024AI Control frameworksRobust safety without perfect alignment
2024Collective Constitutional AIDemocratic input into AI values

Velocity: Rapid progress; many techniques emerging.

Implication: On track to solve alignment.

Observation: Despite deploying increasingly capable AI:

  • No catastrophic misalignment incidents
  • No AI causing major harm through misaligned goals
  • Failures are minor and correctable
  • Systems are behaving as intended (mostly)

Interpretation: Either:

  1. Alignment is easier than feared (optimistic view)
  2. We haven’t reached dangerous capability levels yet (pessimistic view)

Optimistic reading: If alignment were fundamentally hard, we’d see more problems already.

Signs of maturation:

  • Serious empirical work replacing speculation
  • Collaboration between academia and industry
  • Government attention and resources
  • Integration of safety into capability research
  • Growing researcher pool

Implication: Alignment is becoming normal science, not impossible problem.

Reasons for optimism:

  1. Current techniques work (RLHF, Constitutional AI, interpretability)
  2. Natural alignment (AI trained on human data learns human values)
  3. Solutions exist for supposedly hard problems
  4. Economic alignment (market favors safe AI)
  5. Sufficient time (decades to solve problem)
  6. Pessimistic arguments flawed (overstate difficulty)
  7. Encouraging trends (progress is real and continuing)

Conclusion: Alignment is hard but tractable engineering problem. Likely to succeed with sustained effort.

ResearcherPositionKey Quote / ReasoningSource
Paul Christiano (ARC)Optimistic”This sort of well-scoped problem that I’m optimistic we can totally solve”80,000 Hours Podcast
Jan Leike (Anthropic)Optimistic”I’m optimistic we can figure this problem out”; believes alignment will become “more and more mature”TIME, 2024, Substack
Dario Amodei (Anthropic)Cautiously optimisticBelieves responsible development can succeed; invested heavily in interpretabilityAnthropic leadership
Sam Altman (OpenAI)Optimistic with caveatsSupports gradual deployment, iterative safetyPublic statements
MIRI / YudkowskyPessimisticCurrent techniques insufficient for superintelligence; fundamental problems unsolvedLessWrong writings
⚖️P(Solve Alignment Before Transformative AI)

Estimates of alignment success

Very unlikely (<20%)
Very likely (>80%)
MIRI / Yudkowsky
5-15%
Very unlikely
●●○
Median alignment researcher
40-60%
Moderate
●○○
Industry optimists
70-90%
Likely
●●○
This argument
70-85%
Likely
●●○

Not claiming:

  • Alignment is trivial (still requires serious work)
  • No risks exist (many risks remain)
  • Current techniques are sufficient (need improvement)
  • Can be careless (still need caution)

Claiming:

  • Alignment is tractable (solvable with effort)
  • On reasonable track (current approach could work)
  • Time available (decades for research)
  • Resources available (funding, talent, compute)

Open questions:

  • Will current techniques scale to superhuman AI?
  • Will value learning continue to work at higher capabilities?
  • Can we detect deceptive alignment if it occurs?
  • Will we maintain gradual progress or hit discontinuities?

These questions have evidence on both sides. Optimism is justified but not certain.

Key Questions

What evidence would convince you alignment is harder than this view suggests?
Which optimistic argument do you find most compelling?

Research priorities:

  • Continue empirical work (RLHF, Constitutional AI, interpretability)
  • Scale current techniques to more capable systems
  • Test carefully at each capability level
  • Integrate safety into capability research

Policy priorities:

  • Support alignment research (but not to exclusion of capability development)
  • Require safety testing before deployment
  • Encourage responsible scaling
  • Facilitate beneficial applications

Not priorities:

  • Pause AI development
  • Extreme precautionary measures
  • Treating alignment as unsolvable
  • Focusing only on x-risk

Reasonable approach:

  • Take alignment seriously (it’s hard, just not impossibly so)
  • Continue research (empirical and theoretical)
  • Test carefully (but don’t halt progress)
  • Deploy responsibly (with safety measures)
  • Update on evidence (could be wrong either direction)

Avoid:

  • Complacency (alignment still requires work)
  • Overconfidence (uncertainties remain)
  • Ignoring risks (they exist)
  • Dismissing pessimistic views (they might be right)

This argument responds to:

Key disagreements:

  • Pessimists: Current techniques won’t scale; fundamental problems remain unsolved
  • Optimists: Current techniques will scale; problems are tractable
  • Evidence: Currently ambiguous; both sides can claim support

The crucial question: Will RLHF/Constitutional AI/interpretability work for superhuman AI?

We don’t know yet. But current trends are encouraging.

The case for tractable alignment:

  1. Empirical success: Current techniques (RLHF, Constitutional AI) work well
  2. Natural alignment: AI learns human values from training data
  3. Tractable solutions: Each hard problem has promising approaches
  4. Aligned incentives: Economics favors safe AI
  5. Sufficient time: Decades for research; gradual progress
  6. Flawed pessimism: Worst-case arguments overstated
  7. Encouraging trends: Progress is real and accelerating

Bottom line: Alignment is hard but solvable. With sustained effort, likely to succeed before transformative AI.

Credence: 70-85% we solve alignment in time.

This doesn’t mean complacency---alignment requires serious, sustained work. But pessimism and defeatism are unwarranted. We’re on a reasonable track.

Empirical Research on Alignment Techniques

Section titled “Empirical Research on Alignment Techniques”