Skip to content
This site is deprecated. See the new version.

Evaluation & Detection

Evaluation methods assess whether AI systems are aligned and safe to deploy.

General Evaluation:

  • Evaluations (Evals): Overview of AI evaluation approaches
  • Alignment Evaluations: Testing for aligned behavior
  • Dangerous Capability Evaluations: Assessing harmful potential

Capability Assessment:

  • Capability Elicitation: Uncovering hidden capabilities
  • Red Teaming: Adversarial testing for vulnerabilities
  • Model Auditing: Systematic capability review

Deception Detection:

  • Scheming Detection: Identifying strategic deception
  • Sleeper Agent Detection: Finding hidden malicious behaviors

Evaluation Scaling:

  • Eval Saturation & The Evals Gap: Accelerating benchmark saturation and its implications
  • Scalable Eval Approaches: Practical tools for scaling evaluation capacity
  • Evaluation Awareness: Models detecting and adapting to evaluation contexts

Deployment Decisions:

  • Safety Cases: Structured arguments for deployment safety