Notes | Ayda Sultan

The superposition hypothesis

June 4, 2025

The superposition hypothesis explains a lot of the intriguing behaviors of neural networks. For one, it explains why ...

May 19, 2025

Reading weights from neural connections results in dense matrices with values filling up the matrix. Pinpointing whic...

May 18, 2025

Another method to understand concepts within token embeddings is linear probing. This basically involves analyzing th...

May 17, 2025

Features are thought to be directions in the embedding space. The superposition problem states that a single directio...

May 14, 2025

How do activation functions affect the degree of model interpretability? Activation functions provide models with hig...

May 13, 2025

To put concepts into a practical example, we can try to simulate a trivial example to understand the general flow. We...

May 11, 2025

Each neuron in a neural network contains some sort of information. The weights corresponding to each neuron dictate t...

May 9, 2025

I recently read a twitter thread about the growing interest of pre-PhD students on mechanistic interpretability. A lo...

May 8, 2025

Features are the atomic, meaningful units in neural networks. Circuits are the connections between them. If we unders...

May 7, 2025

There are lots of ways to interpret models. Previously, most of the focus was on interpreting behaviors of models by ...

May 6, 2025

Like many machine learning subfields, mechanistic interpretability is filled with unnecessary jargon (assigning new n...

May 5, 2025

Wouldn’t it have been beautiful if each neuron encoded one and only one concept? If we want to remove a specific lear...

May 4, 2025

If not for deep learning, using computers to perform tasks would require carefully crafting each feature and the inte...