Superposition

Like many machine learning subfields, mechanistic interpretability is filled with unnecessary jargon (assigning new names for things that already have names). Superposition is the idea that features can’t align with basis vectors because there are more features that need to be represented in a model than there are neurons. This leads to polysemanticity—one neuron firing for multiple input concepts. All of this is a result of a loss function designed to improve performance on tasks without interpretability regularizations. The model is free to construct its structure based on how much features it can compress within the given parameters. Since this phenomenon prevents us from interpreting model behavior by analyzing individual neurons, it becomes important to understand the geometrical structure of features in high-dimensional spaces.