The philosopher of science Thomas Kuhn described how science is more than experiments and the natural laws derived from them, it is also a community, a culture of language and practices that build knowledge and understanding. Those community norms create the ability to hypothesize, explain, and understand the world. Data science is a science in that it needs the same language and shared paradigm, particularly to explain the behavior of complex models and the meaning of their output. To meaningfully explain AI requires more than algorithms, just like astronomy requires more than telescopes. To explain the behavior of AI models we need a shared language and understanding of what that behavior is and why it’s happening.
Many data scientists propose that models will be able to simply “explain themselves” and LLMs today can output step-by-step narratives describing the steps that their internal components took. These do look like a human explanation and it is tempting to take them as such, but we shouldn’t. These explanations can sound correct but they’re simply rehashed versions of explanations from the training data, statements it is repeating, rather than reporting what actually happened inside the system. When we ask a model to explain itself we get autoregressive storytelling: the model predicting the next part of the story based on other stories that it has learned. The models aren’t “lying” or trying to be sneaky when they don’t accurately report what they’ve done. They’re simply doing what they always do. We’re simply expecting them to do things that they inherently cannot. That’s a problem for two reasons. First, it doesn’t help us understand why a specific model has done something. Second, relying on inaccurate model explanations obscures what explainability should actually do.
Instead of believing that models will explain themselves to us, we need to learn to explain them to one another. That requires that we learn how models learn and respond and that we build them so that they can explain themselves and that we can understand their explanations. Data science requires shared languages, metaphors, and conceptual affordances, just like any other scientific discipline. We need tools and infrastructure around those tools that can provide multiple modes of diagnosing and analyzing behavior and context both to specialists and to everyday users of these systems.
This is where the difference between explaining and interpreting a model becomes crucial. Explanability is a human-oriented explanation of behavior. It is centered in shared reality, values, and means of communicating and requires a theory of mind on both sides of the explanation. Interpretability on the other hand centers on empirically observing phenomena and creating a mechanistic interpretation of what causes and affects that phenomena. I can interpret atmospheric sensor data and use that to explain to a friend that a storm is coming. Most of what we know about the behavior of large models like LLMs comes from interpreting models.
Interpretation and Explanation
At the most essential level of understanding a model is Mechanistic Interpretability, an area of research that aims to show how a model computes results. This can show how a specific part of a model detects negation in a phrase, words like “not,” “never,” or “no” that change the meaning of a sentence. Or how the names in a story are copied by an attention head, a mechanism that acts like the short term memory of the model. Tools like these can be crucial for understanding why the model produced a specific output. However, they can never show us why the output is appropriate or not. Nor can they show us whether output is justified by reasoning, or whether a conclusion is simply a misleading coincidence or a meaningful deduction that shows us how the world truly works. Mechanistic interpretability is helpful when we know exactly what we are looking to understand. In a small machine learning model that outputs a single value, that can be very helpful. In a model with trillions of parameters, it is very difficult to identify and learn from such small interactions within a model. It’s akin to trying to understand global financial markets from looking at the receipts from a single supermarket: helpful to diagnose in the specific, but not to build a broader understanding.
Because of these limitations researchers explore Feature Interpretability, which seeks to understand the features the model encodes rather than the lower-level mathematical operations that generate those features. Feature Interpretability explores how models store and retrieve entities, like people or places, relationships, how those entities are related, and information about those entities and relationships. Feature Interpretability helps show how the model has organized what it “knows”. One of the most important concepts is the residual stream. This is an evolving list of arrays of numbers that contains everything the model knows and believes at each processing step as it generates a response. Because the residual stream is a record of what the model uses to generate its output, researchers can ‘steer’ what information the model accesses and what it generates. Approaches like this shape model behavior but in an invisible fashion. The model isn’t aware it’s being steered, and an end user can’t see that it is happening. Neither knows what it means for the final model output.
Finally, there is Behavioral Interpretability, which is when we encounter a model generating incorrect information or creating text or images we consider toxic. Behavioral interpretability analyzes model outputs and the inputs that trigger them, without looking at the models internals. It is what we use to tune model prompts or create model guardrails. We can observe how the model behaves in a particular context with a particular prompt and attempt to encourage or prevent similar output. Behavioral interpretability shows us how inputs lead to behavior but it can’t give us a clear explanation of what internal reasoning, if any, produced the behavior or what exact mechanism caused it.
Explainability will rely on all of these forms of interpretability but will also require more. Explainability means having a sense of the underlying rules to say with confidence “this happened because of that, if that had not happened, then neither would this”. Explainability means knowing what exactly in the training data has led the model to a specific behavior. Explainability also means knowing how confident the model is when making a prediction or assertion. We do this every day with the physical world, other humans, and even ourselves. It’s how we make sense of all the complex relationships and systems in which we live. Explainability is not simply a technical challenge, it’s a design and communication challenge as well.
What Explainability Can Do
Designers and data scientists are actively working on both sides of the explainability challenge: making models easier to understand, and making models behave more predictably. On the technical side, new approaches to model architecture like Neuro-Symbolic AI combine different kinds of learning in one system: statistical approaches which patterns but know nothing of rules; and symbolic approaches that explicitly encode rules and constraints. A purely neural network model has no intrinsic understanding of the symbols that it ingests or outputs. A purely symbolic model requires every relationship and edge case to be defined symbolically, as in when a programmer writes code. Models like AlphaGo and AlphaFold, created by DeepMind, are hybrids of the two approaches. AlphaGo knows the rules of Go, symbolic learning, and has ingested millions of recorded games of Go to learn from them, neural learning. This approach gives rigor and meaning to the ocean of data and learned connections that comprise a neural network.
On the human interface side of explainability, researchers are developing ways to quantify what matters to humans in model output. These are methods that allow us to rigorously measure plausibility, faithfulness and stability of model outputs. Emerging design patterns like confidence tags to indicate where a model is certain or not, expandable links to references, and visually mapping model attention to outputs all provide intuitive human-readable explanations.
For those of us who interact with LLMs in the form of customer service chatbots or report summarizers knowing that there are fundamental rules to their behavior may not be important. As deep learning models are introduced to more and more critical areas of our daily lives though, being able to causally explain behavior by referring to fundamental principles and laws that govern that behavior throughout the learning and predicting process is critical. We need to not only be able to interpret model behavior but explain the rationales and the values encoded into those models. This means not only giving ourselves tools to interpret but also creating models that rigorously encode specific rules that we can trust.
Understanding the fundamental characteristics of the behavior of AI systems gives us a comprehensible world that we can rely on and build. In the absence of a way to meaningfully explain the phenomena that we observe we end up living in uncertainty, superstition, speculation and suspicion. If we want models that can help us make vital and meaningful decisions, we have to be able to understand and explain their behavior to ourselves and to the rest of the world clearly and accurately.