Why mechanistic?

There are lots of ways to interpret models. Previously, most of the focus was on interpreting behaviors of models by attributing behaviors to specific parts of the input. This was mostly the case in vision interpretability research. Interpretability for language models has been rebranded to ‘mechanistic’ interpretability but similar lines of work were already going on under a different name (and almost fully within academia). Read the paper ‘Mechanistic?’ by Saphra and Wiegreffe.

Neural networks are massive, entangled and hard to decode. Playing cause and effect with only model outputs and inputs seem like a much more feasible and productive pursuit. The problem is that the resulting explanations are often unfaithful, incomplete and do not allow for correcting learned behavior. An even bigger problem is that the explanations themselves need to be explained. If a saliency method attributes certain behavior to some particular patch of an input image, what features exactly within that patch trigger the attribution?