Skip to content

Defense in Depth Model

📋Page Status
Quality:88 (Comprehensive)
Importance:83.5 (High)
Last edited:2025-12-26 (12 days ago)
Words:1.7k
Backlinks:1
Structure:
📊 16📈 2🔗 36📚 018%Score: 12/15
LLM Summary:Mathematical framework showing that five independent AI safety layers with 20-60% individual failure rates can achieve 1-3% combined failure, but deceptive alignment creates correlations (ρ=0.4-0.5) that increase combined failure to 12%+. Quantifies effectiveness of training safety (60-80%), evaluation (varies by method), runtime monitoring (varies), institutional oversight (40-60%), and recovery mechanisms (20-40%) with correlation coefficients between layer pairs.
Model

Defense in Depth Model

Importance83
Model TypeDefense Framework
ScopeLayered Safety Architecture
Key InsightMultiple independent safety layers provide robustness against single-point failures
Related
Model Quality
Novelty
4
Rigor
5
Actionability
5
Completeness
5

Defense in depth applies the security principle of layered protection to AI safety: deploy multiple independent safety measures so that if one fails, others still provide protection. This model provides a mathematical framework for analyzing how safety interventions combine, when multiple weak defenses outperform single strong ones, and how to identify correlated failure modes.

Key finding: Independent layers with 20-60% individual failure rates can achieve combined failure rates of 1-3%, but deceptive alignment creates dangerous correlations that increase combined failure to 12%+. No single AI safety intervention is reliable enough to trust alone - layered defenses with diverse failure modes provide more robust protection.

FactorLevelEvidenceTimeline
SeverityCriticalSingle-layer failures: 20-60%; Independent 5-layer: 1-3%; Correlated 5-layer: 12%+Current
LikelihoodHighAll current safety interventions have significant failure rates2024-2027
TrendImprovingGrowing recognition of need for layered approachesNext 3-5 years
TractabilityMediumImplementation straightforward; reducing correlation difficultOngoing

AI safety operates through five defensive layers, each protecting against different failure modes:

LayerPrimary FunctionKey InterventionsFailure Rate Range
Training SafetyBuild aligned goals during developmentRLHF, Constitutional AI, data curation20-40%
Evaluation SafetyDetect problems before deploymentRed-teaming, interpretability, capability evals25-35%
Runtime SafetyMonitor deployed systemsOutput filtering, monitoring, sandboxing30-50%
Institutional SafetyGovernance and oversightResponsible scaling, audits, regulation40-60%
Recovery SafetyRespond to failuresIncident response, shutdown, rollback20-40%
Loading diagram...

The effectiveness of layered defenses depends critically on independence. When layers fail independently, protection compounds multiplicatively. When correlated, layers fail together.

Layer PairCorrelation (ρ)Primary Correlation SourceImpact
Training-Evaluation0.4Deceptive alignment affects bothHigh correlation reduces redundancy
Training-Runtime0.5Deception evades monitoringHighest correlation pair
Training-Institutional0.2Mostly separate domainsGood independence
Evaluation-Runtime0.3Both rely on behavioral signalsModerate correlation
Institutional-Technical0.1-0.3Different failure mechanismsBest independence

When layers fail independently, combined failure probability follows:

P(all layers fail)=i=1nP(Layer i fails)P(\text{all layers fail}) = \prod_{i=1}^{n} P(\text{Layer } i \text{ fails})

Example: Three layers with 20%, 30%, 40% failure rates:

  • Combined failure: 0.20 × 0.30 × 0.40 = 2.4%
  • Improvement over best single layer: 10x reduction

Real layers exhibit correlation, requiring conditional probability:

P(all layers fail)=P(L1)×P(L2L1)×P(L3L1L2)×P(\text{all layers fail}) = P(L_1) \times P(L_2|L_1) \times P(L_3|L_1 \cap L_2) \times \cdots

Deception scenario:

  • Training fails (produces deceptive model): 20%
  • Evaluation fails given deceptive model: 70%
  • Runtime fails given both failed: 90%
  • Combined: 0.20 × 0.70 × 0.90 = 12.6% (5x worse than independent)

Most layers show partial correlation. Using correlation coefficient ρ ∈ [0,1]:

Peff=Pind(1ρ)×PcorrρP_{\text{eff}} = P_{\text{ind}}^{(1-\rho)} \times P_{\text{corr}}^{\rho}

Correlation LevelρEffective Failure RatePerformance vs Independent
Independent0.02.4%Baseline
Low correlation0.34.1%1.7x worse
Moderate correlation0.55.5%2.3x worse
High correlation0.77.4%3.1x worse
Perfect correlation1.012.6%5.3x worse

Function: Build aligned objectives during model development

Key Interventions:

MethodMechanismFailure ModeEffectiveness
RLHFHuman feedback shapes behaviorReward hacking, humans deceived60-80% success rate
Constitutional AIPrinciples guide responsesPrinciples incomplete/misinterpreted65-75% success rate
Data curationRemove harmful training examplesEmergent harmful patterns70-85% success rate

Correlation drivers: Training methods often share assumptions about alignment definition and measurement, creating correlated vulnerabilities to goal misgeneralization and deceptive alignment.

Function: Detect problems before deployment

Current gaps according to METR:

  • Cannot test for unknown capabilities
  • Red teams often less capable than models
  • Interpretability incomplete at scale
MethodCoverageBlind SpotsIndependence Level
Capability evaluationsKnown dangerous capabilitiesUnknown/emergent capabilitiesMedium
Red-teamingAdversarial robustnessCapabilities beyond red teamHigh
InterpretabilityInternal model statesComplex/distributed representationsHigh
Model organismsControlled failure analysisGeneralization to real deploymentMedium

Function: Monitor and constrain deployed systems

Implementation challenges:

  • Advanced AI may evade monitoring
  • Human oversight becomes bottleneck
  • Sandbox escapes possible with sufficient capability
InterventionProtection LevelFailure ScenariosCorrelation Risk
Output filteringSurface-levelSteganographic communicationHigh with evaluation
Behavioral monitoringBehavioral anomaliesNovel attack patternsHigh with training
SandboxingCapability limitationSandbox escapes, social engineeringMedium
Human oversightFinal approvalHuman manipulation, overloadLow

Function: Organizational and regulatory oversight

Advantage: Largely independent of technical failure modes

MechanismIndependence LevelCurrent StatusEffectiveness
Responsible scaling policiesMediumAnthropic, OpenAI implementingUnproven at scale
External auditsHighAISI developing frameworksEarly stage
Regulatory oversightHighEU AI Act, US executive orderImplementation gaps

Function: Respond when prevention fails

Critical limitation: May be impossible with sufficiently advanced systems

CapabilityFeasibilityCurrent ReadinessKey Dependencies
Shutdown proceduresHigh for current systemsBasic protocols existSystem cooperation
Incident responseMediumLimited AI-specific experienceDetection speed
Damage containmentLow for advanced AIMinimal preparationTechnical capabilities
System rollbackMediumStandard software practicesBackup integrity

Deceptive Alignment: The Primary Correlation Driver

Section titled “Deceptive Alignment: The Primary Correlation Driver”

Deceptive alignment creates the most dangerous correlation because it systematically defeats Layers 1-3:

Loading diagram...

Breaking deception correlation:

ApproachMechanismPromiseStatus
InterpretabilityDirect cognition analysisCould detect deceptive reasoningResearch ongoing
AI ControlAssume deception, limit capabilitiesReduces impact regardlessPromising results
Diverse evaluationMultiple testing approachesDifferent deception vulnerabilitiesEarly development
Adversarial trainingTrain against deceptionMay teach better deceptionMixed results

Sudden capability jumps can defeat multiple layers calibrated for weaker systems:

ScenarioAffected LayersFailure MechanismMitigation
Training surprise1-3Safety measures insufficientConservative thresholds
Evaluation gap2-3Tests miss new capabilitiesBroad capability assessment
Runtime escape3-5Monitoring inadequateStaged deployment

Mathematical threshold: Multiple weak defenses outperform single strong when:

i=1nPiweak<Psinglestrong\prod_{i=1}^{n} P_i^{weak} < P_{single}^{strong}

Example: Three 30% defenses vs one 10% defense

  • Independent: 0.30³ = 2.7% (better than 10%)
  • Correlated (ρ=0.5): ~8% (worse than 10%)

Key insight: Independence is more important than individual strength.

PriorityInvestment CategoryRationaleFunding Range
Tier 1Reduce correlation between Layers 1-3Biggest risk reduction$100-250M annually
Tier 2Strengthen institutional oversightIndependent backup$50-120M annually
Tier 3Improve individual layer effectivenessDiminishing returns$75-150M annually
Tier 4Recovery capability developmentLast resort preparation$25-75M annually

For AI Labs:

  • Conduct correlation audits between safety layers
  • Use independent teams for different layers
  • Invest heavily in deception-robust evaluation
  • Develop shutdown and rollback capabilities

For Policymakers:

  • Require demonstration of layer independence
  • Mandate multiple independent safety evaluations
  • Establish government monitoring capabilities
  • Create emergency response frameworks

For Safety Researchers:

  • Prioritize research that breaks correlation (especially interpretability)
  • Develop evaluation methods immune to deception
  • Study institutional safety mechanisms
  • Build recovery and containment capabilities
LayerImplementation QualityKey GapsTrend
TrainingModerateDeception detection, robustnessImproving
EvaluationWeakComprehensive testing, unknown capabilitiesSlow progress
RuntimeBasicMonitoring sophistication, human oversightEarly development
InstitutionalMinimalRegulatory frameworks, enforcementAccelerating
RecoveryVery weakShutdown capabilities, incident responseNeglected

Likely developments:

  • Training layer: Better RLHF, constitutional approaches reach maturity
  • Evaluation layer: Standardized testing suites, some interpretability progress
  • Runtime layer: Improved monitoring, basic AI control implementation
  • Institutional layer: Regulatory frameworks implemented, auditing standards
  • Recovery layer: Basic protocols developed but untested at scale

Key uncertainties:

  • Will interpretability break deception correlation?
  • Can institutional oversight remain independent as AI capabilities grow?
  • Are recovery mechanisms possible for advanced AI systems?

“The key insight is that we need multiple diverse approaches, not just better versions of the same approach.” - Paul Christiano on alignment strategy

“Defense in depth is essential, but we must be realistic about correlation. Deceptive alignment could defeat multiple technical layers simultaneously.” - Evan Hubinger on correlated failures

“Institutional oversight may be our most important defense because it operates independently of technical capabilities.” - Allan Dafoe on governance importance

Key Questions

What are the true correlation coefficients between current safety interventions?
Can interpretability research make sufficient progress to detect deceptive alignment?
Will institutional oversight remain effective as AI systems become more capable?
Is recovery possible once systems exceed certain capability thresholds?
How many layers are optimal given implementation costs and diminishing returns?

Strengths:

  • Provides quantitative framework for analyzing safety combinations
  • Identifies correlation as the critical factor in defense effectiveness
  • Offers actionable guidance for resource allocation and implementation

Limitations:

  • True correlation coefficients are unknown and may vary significantly
  • Assumes static failure probabilities but capabilities and threats evolve
  • May not apply to superintelligent systems that understand all defensive layers
  • Treats adversarial threats as random events rather than strategic optimization
  • Does not account for complex dynamic interactions between layers

Critical assumption: The model assumes that multiple layers can remain meaningfully independent even as AI systems become more capable at strategic deception and manipulation.

PaperKey ContributionLink
Greenblatt et al. (2024)AI Control framework assuming potential deceptionarXiv:2312.06942
Shevlane et al. (2023)Model evaluation for extreme risksarXiv:2305.15324
Ouyang et al. (2022)Training language models to follow instructions with human feedbackarXiv:2203.02155
Hubinger et al. (2024)Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingarXiv:2401.05566
OrganizationReportFocusLink
AnthropicResponsible Scaling PolicyLayer implementation frameworkanthropic.com
METRModel Evaluation ResearchEvaluation layer gapsmetr.org
MIRISecurity Mindset and AI AlignmentAdversarial perspectiveintelligence.org
RANDDefense in Depth for AI SystemsMilitary security applicationsrand.org
DocumentJurisdictionRelevanceLink
EU AI ActEuropean UnionRegulatory requirements for layered oversightdigital-strategy.ec.europa.eu
Executive Order on AIUnited StatesFederal approach to AI safety requirementswhitehouse.gov
UK AI Safety SummitUnited KingdomInternational coordination on safety measuresgov.uk