Back to Home

Interpretability

Three years ago, I was doing neuroscience and epigenetics research at the University of Toronto. When I pivoted into software, I was instantly drawn towards building interpretable models because it was analagous to how we study the brain. I've curated a list of my favourite blogs on mechanistic interpretability.

A Practical Approach to Verifying Code at Scale - OpenAI

December 1, 2025

We train and deploy an AI review agent optimised for precision and real-world use, enabling oversight to scale with autonomous code generation.

Debugging misaligned completions with sparse-autoencoder latent attribution - OpenAI

Dec 1, 2025

Efficiently finding features that cause behaviors.

Emergent Introspective Awareness in Large Language Models - Anthropic

October 29th, 2025

We find evidence that language models can introspect on their internal states.

On the Biology of a Large Language Model - Anthropic

March 27, 2025

We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research - Deepmind

March 26, 2025

Snippets about research from the GDM mechanistic interpretability team

Circuit Tracing: Revealing Computational Graphs in Language Models - Anthropic

March 27, 2025

We introduce a method to uncover mechanisms underlying behaviors of language models

Alignment Faking in Large Language Models - Anthropic

December 18, 2024

Demonstrating the ability of language models to fake alignment by hallucinating about their internal states.

Mapping the Mind of a Large Language Model - Anthropic

May 21, 2024

Identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - Anthropic

May 21, 2024

Sparse autoencoders produce interpretable features for large models.