Skip to content
This site is deprecated. See the new version.

Table Candidates

These table rows have rating combinations that suggest they contain surprising or important information worth extracting as standalone insights. Each card includes a potential insight template you can copy and refine.

38Total Candidates
27Safety Approaches
11Accident Risks
Filter by source:
What makes a table row insight-worthy?
  • Safety Approaches: Capability-dominant differential progress, weak/no deception robustness, PRIORITIZE/DEFUND recommendations, unclear net safety
  • Accident Risks: Catastrophic/existential severity combined with difficult detectability, lab-demonstrated evidence of severe risks

RLHF

Safety Approaches
View
Matched Criteria
Capability-dominant (questionable safety value)Weak/no deception robustnessReduce funding recommendationUnclear/harmful net safetyDoes not scale to superintelligence
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:DOMINANT
Differential Progress:CAPABILITY-DOMINANT
Deception Robust:NONE
Recommendation:REDUCE
Potential Insight
"RLHF provides more capability uplift (DOMINANT) than safety benefit (LOW-MEDIUM), offers none deception robustness - A deceptive model could easily learn to produce human-approved outputs while having different goals, does not scale to superintelligence - Human feedback can't scale to superhuman tasks; humans can't evaluate what they can't understand, is recommended to reduce funding (Already overfunded; marginal safety $ better spent elsewhere), has unclear net impact on world safety."

Constitutional AI / RLAIF

Safety Approaches
View
Matched Criteria
Weak/no deception robustnessUnclear/harmful net safety
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:SIGNIFICANT
Differential Progress:CAPABILITY-LEANING
Deception Robust:WEAK
Recommendation:MAINTAIN
Potential Insight
"Constitutional AI / RLAIF offers weak deception robustness - If base model is deceptive, constitutional AI oversight inherits limitations, has unclear net impact on world safety."

AI Safety via Debate

Safety Approaches
View
Matched Criteria
Unclear/harmful net safety
Key Ratings
Safety Uplift:UNKNOWN
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:PARTIAL
Recommendation:INCREASE
Potential Insight
"AI Safety via Debate has unclear net impact on world safety."

Weak-to-Strong Generalization

Safety Approaches
View
Matched Criteria
Unclear/harmful net safety
Key Ratings
Safety Uplift:UNKNOWN
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:UNKNOWN
Recommendation:INCREASE
Potential Insight
"Weak-to-Strong Generalization has unclear net impact on world safety."

Reward Modeling

Safety Approaches
View
Matched Criteria
Capability-dominant (questionable safety value)Weak/no deception robustnessReduce funding recommendationUnclear/harmful net safety
Key Ratings
Safety Uplift:LOW
Capability Uplift:SIGNIFICANT
Differential Progress:CAPABILITY-DOMINANT
Deception Robust:NONE
Recommendation:REDUCE
Potential Insight
"Reward Modeling provides more capability uplift (SIGNIFICANT) than safety benefit (LOW), offers none deception robustness - Reward model can be gamed by sophisticated policy, is recommended to reduce funding (Already heavily funded; inherits RLHF problems), has unclear net impact on world safety."

Model Specifications

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Model Specifications offers weak deception robustness - Specs define behavior; don't ensure genuine compliance."

Adversarial Training

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:SOME
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Adversarial Training offers none deception robustness - Adversarial training targets external attacks, not internal deception."

Mechanistic Interpretability

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:LOW (now) / HIGH (potential)
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:STRONG (if works)
Recommendation:PRIORITIZE
Potential Insight
"Mechanistic Interpretability is rated PRIORITIZE (One of few paths to detecting deception; needs much more investment)."

Dangerous Capability Evaluations

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Dangerous Capability Evaluations offers weak deception robustness - Deceptive model might hide capabilities during evals."

Red Teaming

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Red Teaming offers none deception robustness - Deceptive model would pass red teaming."

Alignment Evaluations

Safety Approaches
View
Matched Criteria
Weak/no deception robustnessHigh priority recommendation
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:PRIORITIZE
Potential Insight
"Alignment Evaluations offers weak deception robustness - Deceptive model could fake alignment on evals, is rated PRIORITIZE (Critical gap; we need better ways to measure alignment)."

Third-Party Model Auditing

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Third-Party Model Auditing offers weak deception robustness - Auditors face same detection challenges."

AI Safety Cases

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:MEDIUM-HIGH
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:PARTIAL
Recommendation:PRIORITIZE
Potential Insight
"AI Safety Cases is rated PRIORITIZE (Promising framework; severely underdeveloped for AI)."

Capability Elicitation

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Capability Elicitation offers weak deception robustness - Deceptive model might hide capabilities during elicitation."

Sleeper Agent Detection

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH (if works)
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:UNKNOWN
Recommendation:PRIORITIZE
Potential Insight
"Sleeper Agent Detection is rated PRIORITIZE (Core alignment problem; needs much more work)."

Output Filtering

Safety Approaches
View
Matched Criteria
Weak/no deception robustnessDoes not scale to superintelligence
Key Ratings
Safety Uplift:LOW
Capability Uplift:TAX
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Output Filtering offers none deception robustness - Deceptive model could bypass or manipulate filters, does not scale to superintelligence - Sophisticated users/models can evade filters."

Refusal Training

Safety Approaches
View
Matched Criteria
Weak/no deception robustnessDoes not scale to superintelligence
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:TAX
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Refusal Training offers none deception robustness - Refusals are behavioral, not goal-level, does not scale to superintelligence - Jailbreaks consistently found; arms race."

Monitoring / Trip Wires

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Monitoring / Trip Wires offers weak deception robustness - Deceptive AI would avoid triggering monitors."

Circuit Breakers / Inference Interventions

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:TAX
Differential Progress:SAFETY-LEANING
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Circuit Breakers / Inference Interventions offers weak deception robustness - Deceptive model could generate harm before circuit breaks."

Compute Governance

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:MEDIUM-HIGH
Capability Uplift:NEGATIVE
Differential Progress:SAFETY-DOMINANT
Deception Robust:N/A
Recommendation:PRIORITIZE
Potential Insight
"Compute Governance is rated PRIORITIZE (One of few levers to affect timeline; very underfunded)."

Evals-Based Deployment Gates

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Evals-Based Deployment Gates offers weak deception robustness - Deceptive models could pass evals."

Pause / Moratorium

Safety Approaches
View
Matched Criteria
Unclear/harmful net safety
Key Ratings
Safety Uplift:HIGH (if implemented)
Capability Uplift:NEGATIVE
Differential Progress:SAFETY-DOMINANT
Deception Robust:N/A
Recommendation:MAINTAIN
Potential Insight
"Pause / Moratorium has unclear net impact on world safety."

International AI Governance

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:MEDIUM-HIGH
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:N/A
Recommendation:PRIORITIZE
Potential Insight
"International AI Governance is rated PRIORITIZE (Critical infrastructure; severely underdeveloped)."

Corrigibility Research

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH (if solved)
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:PARTIAL
Recommendation:PRIORITIZE
Potential Insight
"Corrigibility Research is rated PRIORITIZE (Severely underfunded for importance; key unsolved problem)."

Eliciting Latent Knowledge (ELK)

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH (if solved)
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:STRONG (if solved)
Recommendation:PRIORITIZE
Potential Insight
"Eliciting Latent Knowledge (ELK) is rated PRIORITIZE (Solves deception problem if successful; needs breakthrough)."

Capability Unlearning / Removal

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:HIGH (if works)
Capability Uplift:NEGATIVE
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Capability Unlearning / Removal offers weak deception robustness - Model might hide rather than truly unlearn capabilities."

AI Control

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:PARTIAL
Recommendation:PRIORITIZE
Potential Insight
"AI Control is rated PRIORITIZE (Fundamental requirement; increasingly important with agentic AI)."

Mesa-Optimization

Accident Risks
View
Matched Criteria
Severe + hard to detect
Key Ratings
Severity:CATASTROPHIC
Detectability:VERY_DIFFICULT
Evidence:THEORETICAL
Timeline:UNCERTAIN
Potential Insight
"Mesa-Optimization: catastrophic severity but very difficult to detect; Well-established theoretically (Hubinger et al. 2019). Whether current LLMs are mesa-optimizers is debated.."

Instrumental Convergence

Accident Risks
View
Matched Criteria
Lab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:EXISTENTIAL
Detectability:MODERATE
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Instrumental Convergence: already demonstrated in lab settings; relevant to current AI systems; Formal proofs (Turner et al. 2021). Empirical evidence: 78% alignment faking (Anthropic 2024), 79% shutdown resistance (Palisade 2025).."

Deceptive Alignment

Accident Risks
View
Matched Criteria
Severe + hard to detectLab-demonstrated catastrophic risk
Key Ratings
Severity:EXISTENTIAL
Detectability:VERY_DIFFICULT
Evidence:DEMONSTRATED_LAB
Timeline:NEAR_TERM
Potential Insight
"Deceptive Alignment: existential severity but very difficult to detect; already demonstrated in lab settings; Sleeper agents study (Anthropic 2024) showed deception persists through safety training. 78% alignment faking rate observed.."

Goal Misgeneralization

Accident Risks
View
Matched Criteria
Current timeline + severe
Key Ratings
Severity:HIGH
Detectability:MODERATE
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Goal Misgeneralization: already demonstrated in lab settings; relevant to current AI systems; Demonstrated in RL environments (CoinRun, etc.). Less clear if this occurs in LLMs.."

Scheming

Accident Risks
View
Matched Criteria
Severe + hard to detectLab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:CATASTROPHIC
Detectability:DIFFICULT
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Scheming: catastrophic severity but difficult to detect; already demonstrated in lab settings; relevant to current AI systems; Apollo Research (Dec 2024): All 5 tested frontier models demonstrated in-context scheming. o1 disabled oversight in 5% of trials.."

Sandbagging

Accident Risks
View
Matched Criteria
Current timeline + severe
Key Ratings
Severity:HIGH
Detectability:DIFFICULT
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Sandbagging: already demonstrated in lab settings; relevant to current AI systems; Claude 3.5 Sonnet strategically underperformed in evaluations (Apollo Research 2024). Multiple models showed sandbagging capability.."

Power-Seeking

Accident Risks
View
Matched Criteria
Lab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:EXISTENTIAL
Detectability:MODERATE
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Power-Seeking: already demonstrated in lab settings; relevant to current AI systems; Formal proofs (Turner 2021). Empirical: o3 sabotaged shutdown in 79% of tests (Palisade 2025).."

Corrigibility Failure

Accident Risks
View
Matched Criteria
Lab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:CATASTROPHIC
Detectability:EASY
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Corrigibility Failure: already demonstrated in lab settings; relevant to current AI systems; o3 sabotaged shutdown in 79% of tests (Palisade 2025). 7% even with explicit "allow shutdown" instruction. Claude 3.7 showed 0% resistance.."

Treacherous Turn

Accident Risks
View
Matched Criteria
Severe + hard to detect
Key Ratings
Severity:EXISTENTIAL
Detectability:VERY_DIFFICULT
Evidence:THEORETICAL
Timeline:MEDIUM_TERM
Potential Insight
"Treacherous Turn: existential severity but very difficult to detect; Theoretical reasoning + proof-of-concept. Sleeper agents study shows deception can persist; actual treacherous turn not yet observed.."

Sharp Left Turn

Accident Risks
View
Matched Criteria
Severe + hard to detect
Key Ratings
Severity:EXISTENTIAL
Detectability:VERY_DIFFICULT
Evidence:SPECULATIVE
Timeline:MEDIUM_TERM
Potential Insight
"Sharp Left Turn: existential severity but very difficult to detect; Theoretical scenario. No direct evidence. Some analogies in capability jumps.."

Emergent Capabilities

Accident Risks
View
Matched Criteria
Current timeline + severe
Key Ratings
Severity:HIGH
Detectability:MODERATE
Evidence:OBSERVED_CURRENT
Timeline:CURRENT
Potential Insight
"Emergent Capabilities: relevant to current AI systems; Well-documented in scaling research (GPT-4, etc.). Some capabilities appear suddenly at scale.."