Through the years, Transformer-based massive language fashions (LLMs) have made substantial progress throughout a variety of duties evolving from easy data retrieval techniques to stylish brokers able to coding, writing, conducting analysis, and rather more. However regardless of their capabilities, these fashions are nonetheless largely black bins. Given an enter, they accomplish the duty however we lack intuitive methods to know how the duty was really achieved.
LLMs are designed to foretell the statistically greatest subsequent phrase/token. However do they solely give attention to predicting the subsequent token, or plan forward? As an illustration, after we ask a mannequin to put in writing a poem, is it producing one phrase at a time, or is it anticipating rhyme patterns earlier than outputting the phrase? or when requested about fundamental reasoning query like what’s state capital the place metropolis Dallas is situated? They typically produce outcomes that appears like a sequence of reasoning, however did the mannequin really use that reasoning? We lack visibility into the mannequin’s inner thought course of. To grasp LLMs, we have to hint their underlying logic.
The examine of LLMs inner computation falls below “Mechanistic Interpretability,” which goals to uncover the computational circuit of fashions. Anthropic is among the main AI corporations engaged on interpretability. In March 2025, they printed a paper titled “Circuit Tracing: Revealing Computational Graphs in Language Models,” which goals to deal with the issue of circuit tracing.
This submit goals to clarify the core concepts behind their work and construct a basis for understating circuit tracing in LLMs.
What’s a circuit in LLMs?
Earlier than we are able to outline a “circuit” in language fashions, we first have to look contained in the LLM. It’s a Neural Network constructed on the transformer structure, so it appears apparent to deal with neurons as a fundamental computational unit and interpret the patterns of their activations throughout layers because the mannequin’s computation circuit.
Nonetheless, the “Towards Monosemanticity” paper revealed that monitoring neuron activations alone doesn’t present a transparent understanding of why these neurons are activated. It is because particular person neurons are sometimes polysemantic they reply to a mixture of unrelated ideas.
The paper additional confirmed that neurons are composed of extra elementary items known as options, which seize extra interpretable data. In actual fact, a neuron may be seen as a mixture of options. So quite than tracing neuron activations, we goal to hint characteristic activations the precise items of which means driving the mannequin’s outputs.
With that, we are able to outline a circuit as a sequence of characteristic activations and connections utilized by the mannequin to rework a given enter into an output.
Now that we all know what we’re on the lookout for, let’s dive into the technical setup.
Technical Setup
We’ve established that we have to hint characteristic activations quite than neuron activations. To allow this, we have to convert the neurons of the present LLM fashions into options, i.e. construct a substitute mannequin that represents computations when it comes to options.
Earlier than diving into how this substitute mannequin is constructed, let’s briefly evaluate the structure of Transformer-based massive language fashions.
The next diagram illustrates how transformer-based language fashions function. The thought is to transform the enter into tokens utilizing embeddings. These tokens are handed to the eye block, which calculates the relationships between tokens. Then, every token is handed to the multi-layer perceptron (MLP) block, which additional refines the token utilizing a non-linear activation and linear transformations. This course of is repeated throughout many layers earlier than the mannequin generates the ultimate output.
Now that we now have laid out the construction of transformer based mostly LLM, let’s seems at what transcoders are. The authors have used a “Transcoder” to develop the substitute mannequin.
Transcoders
A transcoder is a neural community (usually with a a lot increased dimension than LLM’s dimension) in itself designed to interchange the MLP block in a transformer mannequin with a extra interpretable, functionally equal element (characteristic).

It processes tokens from the eye block in three phases: encoding, sparse activation, and decoding. Successfully, it scales the enter to a higher-dimensional area, applies activation to power the mannequin to activate solely sparse options, after which compresses the output again to the unique dimension within the decoding stage.

With a fundamental understanding of transformer-based LLMs and transcoder, let’s take a look at how a transcoder is used to construct a substitute mannequin.
Assemble a substitute mannequin
As talked about earlier, a transformer block usually consists of two important parts: an consideration block and an MLP block (feedforward community). To construct a substitute mannequin, the MLP block within the unique transformer mannequin is changed with a transcoder. This integration is seamless as a result of the transcoder is educated to imitate the output of the unique MLP, whereas additionally exposing its inner computations via sparse and modular options.
Whereas commonplace transcoders are educated to mimic the MLP conduct inside a single transformer layer, the authors of the paper used a cross layer transcoder (CLT), which captures the mixed results of a number of transcoder blocks throughout a number of layers. That is essential as a result of it permits us to trace if a characteristic is unfold throughout a number of layers, which is required for circuit tracing.
The under picture illustrates how the cross-layer transcoders (CLT) setup is utilized in constructing a substitute mannequin. The Transcoder output at layer 1 contributes to setting up the MLP-equivalent output in all of the higher layers till the top.

Facet Observe: the next picture is from the paper and reveals how a substitute mannequin is constructed. it replaces the neuron of the unique mannequin with options.

Now that we perceive the structure of the substitute mannequin, let’s take a look at how the interpretable presentation is constructed on the substitute mannequin’s computational path.
Interpretable presentation of mannequin’s computation: Attribution graph
To construct the interpretable illustration of the mannequin’s computational path, we begin from the mannequin’s output characteristic and hint backward via the characteristic community to uncover which earlier characteristic contributed to it. That is carried out utilizing the backward Jacobian, which tells how a lot a characteristic within the earlier layer contributed to the present characteristic activation, and is utilized recursively till we attain the enter. Every characteristic is taken into account as a node and every affect as an edge. This course of can result in a fancy graph with hundreds of thousands of edges and nodes, therefore pruning is finished to maintain the graph compact and manually interpretable.
The authors consult with this computational graph as an attribution graph and have additionally developed a instrument to examine it. This varieties the core contribution of the paper.
The picture under illustrate a pattern attribution graph.

Now, with all this understanding, we are able to go to characteristic interpretability.
Function interpretability utilizing an attribution graph
The researchers used attribution graphs on Anthropic’s Claude 3.5 Haiku mannequin to check the way it behaves throughout completely different duties. Within the case of poem era, they found that the mannequin doesn’t simply generate the subsequent phrase. It engages in a type of planning, each ahead and backward. Earlier than producing a line, the mannequin identifies a number of potential rhyming or semantically applicable phrases to finish with, then works backward to craft a line that naturally results in that concentrate on. Surprisingly, the mannequin seems to carry a number of candidate finish phrases in thoughts concurrently, and it will probably restructure all the sentence based mostly on which one it finally chooses.
This system affords a transparent, mechanistic view of how language fashions generate structured, inventive textual content. This can be a important milestone for the AI neighborhood. As we develop more and more highly effective fashions, the power to hint and perceive their inner planning and execution might be important for guaranteeing alignment, security, and belief in AI techniques.
Limitations of the present strategy
Attribution graphs provide a method to hint mannequin conduct for a single enter, however they don’t but present a dependable technique for understanding world circuits or the constant mechanisms a mannequin makes use of throughout many examples. This evaluation depends on changing MLP computations with transcoders, however it’s nonetheless unclear whether or not these transcoders actually replicate the unique mechanisms or just approximate the outputs. Moreover, the present strategy highlights solely energetic options, however inactive or inhibitory ones may be simply as essential for understanding the mannequin’s conduct.
Conclusion
Circuit tracing by way of attribution graph is an early however essential step towards understanding how language fashions work internally. Whereas this strategy nonetheless has an extended method to go, the introduction of circuit tracing marks a serious milestone on the trail to true interpretability.