Notes
Evening thoughts, unedited.
Sparse matrices
May 19, 2025
Reading weights from neural connections results in dense matrices with values filling up the matrix. Pinpointing whic...
Linear probing
May 18, 2025
Another method to understand concepts within token embeddings is linear probing. This basically involves analyzing th...
Mech interp on token embeddings
May 17, 2025
Features are thought to be directions in the embedding space. The superposition problem states that a single directio...
Activation functions
May 14, 2025
How do activation functions affect the degree of model interpretability? Activation functions provide models with hig...
Example setup
May 13, 2025
To put concepts into a practical example, we can try to simulate a trivial example to understand the general flow. We...
From neurons to circuits
May 11, 2025
Each neuron in a neural network contains some sort of information. The weights corresponding to each neuron dictate t...
The rise of mechanistic interpretability
May 9, 2025
I recently read a twitter thread about the growing interest of pre-PhD students on mechanistic interpretability. A lo...
Circuits
May 8, 2025
Features are the atomic, meaningful units in neural networks. Circuits are the connections between them. If we unders...
Why mechanistic?
May 7, 2025
There are lots of ways to interpret models. Previously, most of the focus was on interpreting behaviors of models by ...
Superposition
May 6, 2025
Like many machine learning subfields, mechanistic interpretability is filled with unnecessary jargon (assigning new n...
What on mars is a feature?
May 5, 2025
Wouldn’t it have been beautiful if each neuron encoded one and only one concept? If we want to remove a specific lear...
What is mechanistic interpretability?
May 4, 2025
If not for deep learning, using computers to perform tasks would require carefully crafting each feature and the inte...