Notes

Evening thoughts, unedited.

Sparse matrices

May 19, 2025

Reading weights from neural connections results in dense matrices with values filling up the matrix. Pinpointing whic...


Linear probing

May 18, 2025

Another method to understand concepts within token embeddings is linear probing. This basically involves analyzing th...


Mech interp on token embeddings

May 17, 2025

Features are thought to be directions in the embedding space. The superposition problem states that a single directio...


Activation functions

May 14, 2025

How do activation functions affect the degree of model interpretability? Activation functions provide models with hig...


Example setup

May 13, 2025

To put concepts into a practical example, we can try to simulate a trivial example to understand the general flow. We...


From neurons to circuits

May 11, 2025

Each neuron in a neural network contains some sort of information. The weights corresponding to each neuron dictate t...


The rise of mechanistic interpretability

May 9, 2025

I recently read a twitter thread about the growing interest of pre-PhD students on mechanistic interpretability. A lo...


Circuits

May 8, 2025

Features are the atomic, meaningful units in neural networks. Circuits are the connections between them. If we unders...


Why mechanistic?

May 7, 2025

There are lots of ways to interpret models. Previously, most of the focus was on interpreting behaviors of models by ...


Superposition

May 6, 2025

Like many machine learning subfields, mechanistic interpretability is filled with unnecessary jargon (assigning new n...


What on mars is a feature?

May 5, 2025

Wouldn’t it have been beautiful if each neuron encoded one and only one concept? If we want to remove a specific lear...


What is mechanistic interpretability?

May 4, 2025

If not for deep learning, using computers to perform tasks would require carefully crafting each feature and the inte...