Skip to content

Interpretability

Interpretability research aims to understand what AI systems are “thinking” and why they behave as they do.

Overview:

Interpretability: The field and its importance for safety

Mechanistic Approaches:

Mechanistic Interpretability: Reverse-engineering neural networks
Sparse Autoencoders: Learning interpretable features
Probing: Testing for specific knowledge or concepts
Circuit Breakers: Identifying and modifying specific circuits

Representation-Based:

Representation Engineering: Controlling behavior via internal representations