Skip to content

Carlsmith's Six-Premise Argument

📋Page Status
Quality:4 (Stub)⚠️
Importance:90 (Essential)
Last edited:2026-01-02 (5 days ago)
Words:1.9k
Structure:
📊 8📈 2🔗 32📚 130%Score: 13/15
LLM Summary:Carlsmith's 2022 report decomposes AI x-risk into six conditional premises: (1) advanced AI developed by 2070 (65%), (2) strong deployment incentives (80%), (3) alignment harder than misalignment (40%), (4) misaligned AI seeks power (65%), (5) power-seeking scales to disempowerment (40%), (6) disempowerment is catastrophic (95%). Combined estimate: >10% existential catastrophe by 2070. Superforecaster comparison showed P3 and P4 as key cruxes.
Model

Carlsmith's Six-Premise Argument

Importance90
Model TypeProbability Decomposition
Target RiskPower-Seeking AI X-Risk
Combined Estimate>10% by 2070
Model Quality
Novelty
3
Rigor
5
Actionability
4
Completeness
5

Joe Carlsmith’s 2022 report “Is Power-Seeking AI an Existential Risk?” provides the most rigorous public framework for estimating AI existential risk. Rather than offering a single probability, Carlsmith decomposes the argument into six conditional premises, each with its own credence. This enables structured disagreement—critics can identify which premises they reject rather than disputing a black-box estimate.

The framework focuses on APS systems (Advanced capabilities, agentic Planning, Strategic awareness) and asks: what’s the probability that building such systems leads to existential catastrophe through power-seeking behavior?

Bottom line: Carlsmith originally estimated ~5% risk of existential catastrophe from power-seeking AI by 2070. He has since updated to >10% based on faster-than-expected capability progress.


Loading diagram...
PremiseQuestionCarlsmith’s CredenceUncertainty
P1: TimelinesWill we develop advanced, agentic, strategically aware AI by 2070?65%Medium
P2: IncentivesWill there be strong incentives to build and deploy such systems?80%Low
P3: Alignment DifficultyIs it substantially harder to build aligned systems than misaligned ones?40%High
P4: Power-SeekingWill some misaligned APS systems seek power in ways that significantly harm humans?65%High
P5: DisempowermentWill this scale to full human disempowerment?40%Very High
P6: CatastropheWould such disempowerment constitute existential catastrophe?95%Low

Combined estimate: 0.65 × 0.80 × 0.40 × 0.65 × 0.40 × 0.95 ≈ 5.2%

Carlsmith notes this is a rough calculation—the premises aren’t fully independent, and there are additional considerations. His all-things-considered estimate is >10% as of 2023.


The claim: By 2070, it will be possible and financially feasible to build AI systems that are:

  • (A)dvanced: Outperform humans at most cognitive tasks
  • (P)lanning: Capable of sophisticated multi-step planning toward goals
  • (S)trategically aware: Understand themselves, their situation, and human society

Why 65%?

  • Rapid progress in deep learning suggests continued advancement
  • Economic incentives are enormous
  • No fundamental barriers identified (though uncertainty remains)
  • 2070 allows ~45 years of development

Key considerations:

  • Timeline estimates have shortened significantly since 2022
  • Some researchers now expect APS-level systems by 2030-2040
  • Carlsmith’s estimate may be conservative by current standards

The claim: Conditional on P1, there will be strong incentives to actually build and deploy APS systems (not just have the capability).

Why 80%?

  • Massive economic value from advanced AI
  • Competitive pressure between companies and nations
  • Difficult to coordinate global restraint
  • Potential military and strategic advantages

Key considerations:

  • Racing dynamics increase this probability
  • Voluntary restraint has limited historical success
  • Even safety-conscious actors face pressure to deploy

P3: Alignment Harder Than Misalignment (40%)

Section titled “P3: Alignment Harder Than Misalignment (40%)”

The claim: Conditional on P1-P2, it’s substantially harder to develop APS systems that don’t pursue misaligned goals than ones that do.

Why 40%? (High uncertainty)

  • Current techniques (RLHF, Constitutional AI) show promise but unproven at scale
  • Goal misgeneralization is a real phenomenon
  • Value specification is genuinely hard
  • But: we’re not starting from scratch; we choose training objectives

This is a key crux: Optimists about AI safety often reject P3—they believe alignment will be tractable with sufficient effort. Pessimists believe the problem is fundamentally hard.

Superforecaster data: This premise showed the highest variance in the superforecaster study.

The claim: Conditional on P1-P3, some deployed misaligned APS systems will seek to gain and maintain power in ways that significantly harm humans.

Why 65%?

  • Instrumental convergence arguments suggest power-seeking is useful for most goals
  • Resource acquisition helps achieve almost any objective
  • Self-preservation is instrumentally useful
  • But: power-seeking requires sophisticated planning; some misaligned systems might be harmlessly misaligned

Key considerations:

  • The Turner et al. (2021) formal results support instrumental convergence
  • Power-seeking doesn’t require malice—just optimization pressure
  • Detection might be possible before catastrophic power is gained

The claim: Conditional on P1-P4, this power-seeking will scale to the point of fully disempowering humanity.

Why 40%? (Very high uncertainty)

  • Requires AI systems to be capable enough to actually seize control
  • Humans might detect and respond before full disempowerment
  • Multiple AI systems might compete rather than cooperate against humans
  • But: sufficiently capable AI might be very difficult to stop

This premise captures “how bad does it get?”

  • Partial harm vs. full disempowerment
  • Recoverable setback vs. permanent loss of control

The claim: Conditional on P1-P5, full human disempowerment constitutes existential catastrophe.

Why 95%?

  • Disempowered humans can’t ensure good outcomes
  • AI goals, even if not actively hostile, likely don’t include human flourishing
  • Loss of control over the long-term future is effectively extinction-equivalent

Key considerations:

  • Some argue AI might coincidentally produce good outcomes
  • “Benevolent dictator AI” scenario seems unlikely but not impossible
  • Most value at stake is in the long-term future

Carlsmith focuses specifically on APS systems—not all AI:

PropertyDefinitionWhy It Matters
AdvancedOutperforms humans at most cognitive tasksNecessary for AI to pose existential threat
PlanningPursues goals through multi-step strategiesEnables instrumental power-seeking
StrategicUnderstands itself, humans, and the situationEnables sophisticated deception and manipulation

Current systems: GPT-4 and Claude have some APS properties but likely don’t fully qualify. They show:

  • Advanced performance on many tasks (A: partial)
  • Limited genuine planning (P: minimal)
  • Some situational awareness (S: emerging)

Why this framing matters: The argument doesn’t apply to narrow AI, tool AI, or systems without these specific properties. Critics can argue that future AI won’t have these properties (rejecting P1) rather than disputing the consequences.


In 2023, Carlsmith worked with superforecasters to test his estimates. Key findings from the comparison study:

PremiseCarlsmithSuperforecasters (Median)Difference
P165%55%-10pp
P280%78%-2pp
P340%25%-15pp
P465%35%-30pp
P540%25%-15pp
P695%85%-10pp
Combined~5-10%~0.4%~10x difference

P3 (Alignment Difficulty): Largest source of disagreement. Superforecasters were more optimistic about alignment tractability.

P4 (Power-Seeking): Second largest disagreement. Superforecasters doubted that misaligned systems would actually pursue power-seeking strategies.

Implications:

  • If you’re skeptical of AI x-risk, these are likely the premises you reject
  • If you’re concerned, P3 and P4 are where safety work has highest leverage
  • Resolving disagreement requires evidence about alignment difficulty and power-seeking likelihood

Different interventions target different premises:

InterventionPrimary PremiseMechanism
Compute governanceP1, P2Slow capability development, reduce deployment incentives
International coordinationP2Reduce racing pressure
Alignment researchP3Make aligned systems easier to build
InterpretabilityP3, P4Detect misalignment before deployment
AI evaluationsP4, P5Identify dangerous capabilities
AI controlP5Contain power-seeking before full disempowerment
RSPsP2, P4, P5Gate deployment on safety
Loading diagram...

Carlsmith ArgumentOur Framework
Full argument (P1-P6)Rapid AI Takeover
P3 focusAlignment Robustness parameter
P4 focusPower-Seeking Conditions model
P2 dynamicsRacing Intensity parameter
PremiseMost Relevant Aggregate
P1 (Timelines)External factor (not a parameter we influence much)
P2 (Incentives)Misuse Potential
P3 (Alignment)Misalignment Potential
P4 (Power-Seeking)Misalignment Potential
P5 (Scaling)Governance Capacity
P6 (Catastrophe)Definition (not a parameter)

FactorDirectionMagnitude
Faster capability progress↑ RiskSignificant
Shorter timelines↑ P1~+10-15pp
Observed emergent behaviors↑ P4Moderate
Better alignment techniques↓ P3Unclear
Overall↑ Risk~5% → >10%

Supporting higher risk:

Supporting lower risk:

  • RLHF and Constitutional AI show some effectiveness
  • No catastrophic failures from deployed systems yet
  • Safety community growing and becoming more sophisticated

ObjectionResponse
”Premises aren’t independent”True—Carlsmith acknowledges this. The multiplication is illustrative, not rigorous.
”APS systems might not be built”Possible, but would require rejecting P1, which seems increasingly implausible.
”Power-seeking is anthropomorphic”Instrumental convergence arguments are about optimization, not psychology.
”We’ll see warning signs”Captured in P5—the question is whether we can respond effectively.
”AI systems will be tools, not agents”APS specifically describes agentic systems; tools are out of scope.
  1. Doesn’t cover all risks: Focuses on power-seeking; doesn’t address catastrophic misuse or gradual disempowerment
  2. Binary framing: Treats each premise as yes/no; reality may be continuous
  3. Sensitive to framing: Different decompositions might yield different estimates
  4. Relies on speculation: All estimates are fundamentally about unprecedented situations

  1. Go through each premise and assign your own credence
  2. Identify which premises you’re most uncertain about
  3. Consider what evidence would update your estimates
  4. Multiply (roughly) to get your overall estimate
  5. Compare to Carlsmith’s and superforecasters’ to understand where you differ

Focus on the premises that:

  • Have highest uncertainty (P3, P4, P5)
  • You personally can influence
  • Would most change the overall estimate if resolved

The framework enables productive disagreement:

  • “I think P3 is too high because…” is more useful than “I think AI risk is overblown”
  • Identifies specific empirical questions that could resolve debates
  • Maps interventions to the premises they address