Evaluation & Detection
Evaluation methods assess whether AI systems are aligned and safe to deploy.
General Evaluation:
- Evaluations (Evals)Safety AgendaAI EvaluationsEvaluations and red-teaming reduce detectable dangerous capabilities by 30-50x when combined with training interventions (o3 covert actions: 13% → 0.4%), but face fundamental limitations against so...Quality: 72/100: Overview of AI evaluation approaches
- Alignment EvaluationsApproachAlignment EvaluationsComprehensive review of alignment evaluation methods showing Apollo Research found 1-13% scheming rates across frontier models, while anti-scheming training reduced covert actions 30x (13%→0.4%) bu...Quality: 65/100: Testing for aligned behavior
- Dangerous Capability EvaluationsApproachDangerous Capability EvaluationsComprehensive synthesis showing dangerous capability evaluations are now standard practice (95%+ frontier models) but face critical limitations: AI capabilities double every 7 months while external...Quality: 64/100: Assessing harmful potential
Capability Assessment:
- Capability ElicitationApproachCapability ElicitationCapability elicitation—systematically discovering what AI models can actually do through scaffolding, prompting, and fine-tuning—reveals 2-10x performance gaps versus naive testing. METR finds AI a...Quality: 91/100: Uncovering hidden capabilities
- Red TeamingApproachRed TeamingRed teaming is a systematic adversarial evaluation methodology for identifying AI vulnerabilities and dangerous capabilities before deployment, with effectiveness rates varying from 10-80% dependin...Quality: 65/100: Adversarial testing for vulnerabilities
- Model AuditingApproachThird-Party Model AuditingThird-party auditing organizations (METR, Apollo, UK/US AISIs) now evaluate all major frontier models pre-deployment, discovering that AI task horizons double every 7 months (GPT-5: 2h17m), 5/6 mod...Quality: 64/100: Systematic capability review
Deception Detection:
- Scheming DetectionApproachScheming & Deception DetectionReviews empirical evidence that frontier models (o1, Claude 3.5, Gemini 1.5) exhibit in-context scheming capabilities at rates of 0.3-13%, including disabling oversight and self-exfiltration attemp...Quality: 91/100: Identifying strategic deception
- Sleeper Agent DetectionApproachSleeper Agent DetectionComprehensive survey of sleeper agent detection methods finding current approaches achieve only 5-40% success rates despite $15-35M annual investment, with Anthropic's 2024 research showing backdoo...Quality: 66/100: Finding hidden malicious behaviors
Evaluation Scaling:
- Eval Saturation & The Evals GapApproachEval Saturation & The Evals GapAnalysis of accelerating AI evaluation saturation, showing benchmarks intended to last years are being saturated in months (MMLU ~4 years, MMLU-Pro ~18 months, HLE ~12 months). A 2022 Nature Commun...Quality: 65/100: Accelerating benchmark saturation and its implications
- Scalable Eval ApproachesApproachScalable Eval ApproachesSurvey of practical approaches for scaling AI evaluation. LLM-as-judge has reached ~40% production adoption with 80%+ human agreement, but Dorner et al. (ICLR 2025 oral) proved a theoretical ceilin...Quality: 65/100: Practical tools for scaling evaluation capacity
- Evaluation AwarenessApproachEvaluation AwarenessModels increasingly detect evaluation contexts and behave differently—Claude Sonnet 4.5 at 58% detection rate (vs. 22% for Opus 4.1), with Opus 4.6 evaluation awareness so high that Apollo Research...Quality: 68/100: Models detecting and adapting to evaluation contexts
Deployment Decisions:
- Safety CasesApproachAI Safety CasesSafety cases are structured arguments adapted from nuclear/aviation to justify AI system safety, with UK AISI publishing templates in 2024 and 3 of 4 frontier labs committing to implementation. Apo...Quality: 91/100: Structured arguments for deployment safety