Why I’ve Been Paying Attention to Interpretability

In April of last year, Dario Amodei published an essay highlighting interpretability . I had been on my own version of that arc for more than a year by then. Recent conversations reminded me of the timeline.

The receipts

January 2024. My very first post on LLMs was my visceral reaction to their abilities: Why are they so good? I was inspired by the Bezos line that they are “discoveries, not inventions.” That unease was the root of everything that followed.
November 2024. Used RAG on two thousand pages of medical literature and saw the model produce analogies that were not in the source material. As a domain-expert, I could verify they were actually good. The model was doing something very different, very real on the inside. Maybe it was ‘understanding’ in its own way. Around the same time I heard the 80,000 Hours podcast episode “What the hell is going on inside Neural Networks?“. It was a 3 hour long dive into something i had never thought or heard of before: Mechanistic Interpretability. I searched and read the Distill.pub thread on circuits, found everything by Chris Olah over the next few days. Perhaps that tuned my youtube feed because that I kept finding more such stuff after that.
Mid 2025. I heard Josh Batson’s Stanford CS25 lecture on the “Biology of an LLM” which described ‘growing’ them as organisms. Data as nutrients, architecture as scaffold, growing towards the sun of a loss function. Josh is especially good with analogies! Eg. in another talk, he said that looking at the matrix values inside the model is like opening a black box and finding that there are billions of different lights in there… and having no idea which light/area does what. It remains a great visual.
Late 2025. I stumbled across Goodfire’s Paint With Ember demo. Steering a model by its internal features looked unmistakably like the neuroanatomy I had studied in medical school. I reached out and met their co-founder in December. I also came across Neel Nanda’s “Concrete Steps to Get Started in Mechanistic Interpretability” and it got on my wish list.
Early 2026. I went through the Blue Dot AGI Strategy course and when the final assignment question (‘where would you apply this?’) was asked, my mind immediately went to interpretability in high-stakes deployment scenarios.

What I think about now

Interpretability as a science feels tractable. Humans can reverse-engineer these systems, reconstruct the abstractions models have actually learned, and build defensive infrastructure on top of that understanding. The work is progressing faster than outside observers realize, including arguments of its own (eg. top-down interpretability).

The open question is timing: whether the discipline matures fast enough to meet the contexts where it will be needed. The way to improve the odds on timing is more specific and serious work by people whose backgrounds let them close particular gaps. I think of healthcare and life sciences as a specific deployment theater within the broader intervention, not the whole of it.