Interpretability
Three years ago, I was doing neuroscience and epigenetics research at the University of Toronto. When I pivoted into software, I was instantly drawn towards building interpretable models because it was analagous to how we study the brain. I've curated a list of my favourite blogs on mechanistic interpretability.
A Practical Approach to Verifying Code at Scale - OpenAI
December 1, 2025
We train and deploy an AI review agent optimised for precision and real-world use, enabling oversight to scale with autonomous code generation.
Debugging misaligned completions with sparse-autoencoder latent attribution - OpenAI
Dec 1, 2025
Efficiently finding features that cause behaviors.
Emergent Introspective Awareness in Large Language Models - Anthropic
October 29th, 2025
We find evidence that language models can introspect on their internal states.
On the Biology of a Large Language Model - Anthropic
March 27, 2025
We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research - Deepmind
March 26, 2025
Snippets about research from the GDM mechanistic interpretability team
Circuit Tracing: Revealing Computational Graphs in Language Models - Anthropic
March 27, 2025
We introduce a method to uncover mechanisms underlying behaviors of language models
Alignment Faking in Large Language Models - Anthropic
December 18, 2024
Demonstrating the ability of language models to fake alignment by hallucinating about their internal states.
Mapping the Mind of a Large Language Model - Anthropic
May 21, 2024
Identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - Anthropic
May 21, 2024
Sparse autoencoders produce interpretable features for large models.