Skip to content
This site is deprecated. See the new version.

Interpretability

Interpretability research aims to understand what AI systems are “thinking” and why they behave as they do.

Overview:

  • Interpretability: The field and its importance for safety

Mechanistic Approaches:

  • Mechanistic Interpretability: Reverse-engineering neural networks
  • Sparse Autoencoders: Learning interpretable features
  • Probing: Testing for specific knowledge or concepts
  • Circuit Breakers: Identifying and modifying specific circuits

Representation-Based:

  • Representation Engineering: Controlling behavior via internal representations