ACM

Communications of the ACM

Home/Careers/Decomposing Language Models Into Understandable Components/Full Text

ACM Careers

Decomposing Language Models Into Understandable Components

By Anthropic
October 10, 2023
Comments

View as: Print Mobile App Share:

white model imposed on old dictionary pages, illustration — Researchers present activating dataset examples and downstream logit effects from 90 learned dictionaries.

Credit: Anthropic

Neural networks are trained on data, not programmed to follow rules. With each step of training, millions or billions of parameters are updated to make the model better at tasks, and by the end, the model is capable of a dizzying array of behaviors. We understand the math of the trained network exactly — each neuron in a neural network performs simple arithmetic — but we don't understand why those mathematical operations result in the behaviors we see.

For those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.

In our latest paper, we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations.

From Anthropic
View Full Article

No entries found