More evidence
I read the induction heads paper
While playing around with induction heads in GPT2, I thought to myself that "What if the input to induction heads isn't present, what do the induction heads pay attention to?" I thought this might be a good question to investigate and after a quick literature search, I stumbled on the attention sink paper and a bunch of other works that made fantastic attempts at answering the question.
Guo, et al., in "Active-Dormant Attention Heads"
While the question was sort of answered already, I thought it would still be a good exercise present the thought process I went through while attempting to answer the question.
In this remainder of this post, I briefly motivate what an attention head is doing, explain induction heads and how to look for them (with visualizations) and show what happens when the induction heads input isn't present.
Feel free to skip parts you're familiar with.
In summary, attention heads move information between tokens!
The residual stream is the main object in the transformer. A way I think of it is that it represents what the model currently thinks about all the tokens in it's context, up to a particular layer. To enrich and further refine the representation of the tokens in the context, attention heads move information from earlier tokens in the context to later tokens in the context
The input to the attention layer is the residual stream [batch_size, seq_len, d_model]. This input is linearly projected using three weight matrices: W_Q, W_K, and W_V, each of shape [d_model, d_model], to produce the query (Q), key (K), and value (V) matrices.
In multi-head attention, Q, K, and V are split into num_heads parts. Each head processes a subspace of the input, with Q and K shaped as [batch_size, num_heads, seq_len, d_k] and V as [batch_size, num_heads, seq_len, d_v], where d_k = d_v = d_model / num_heads.
For each head, attention scores are computed as the dot product of query and key vectors, scaled by 1/√d_k. The scores are passed through a softmax to obtain the attention pattern, which represents the importance of each token relative to others. This pattern is then multiplied by the value vectors to produce the head's output. The outputs of all heads are concatenated and projected using a weight matrix W_O of shape [d_model, d_model] to yield the final attention output, shaped [batch_size, seq_len, d_model].
From Vaswani et al. W_i^Q, W_i^K, and W_i^V are head-specific projection matrices.
The next logical question is, how does each attention head across all the layers know what sort of information to pay attention to? During pre-training, the goal is optimizing the next-token objective w.r.t the parameters of the model, over the language domain. It stands to reason that over the course of multiple steps of gradient descent, each attention head learns to pay attention to some pattern (semantic or syntactic) in the language data and this pattern, when learned, contributes to lower loss.
And indeed, numerous papers have explored this assumption.
In both decoder-only and encoder-decocder transformers, attention heads have been discovered that specialize in attending to different parts of speech, as well as other lingustic propertites such direct objects of verbs, noun determiners, e.t.c.
Interesting mechanisms that further enable LLMs to act autoregressively have been discovered, such as Copy Supression heads
A logical conclusion of the above paragraphs is that what attention heads pay attention to is input-specific. This begs the question : What does an attention head pay attention to, when it's input isn't present?
I'll present a super simplified explanation of Induction heads here, but to better understand Induction heads mechanistically, Callum McDougall wrote a quite interesting explainer blog which I invite readers to check out. The paper
Assume arbitrary tokens A, B. Then assume a sequence of tokens with A followed by B and then some other arbitrary tokens. The next time the model sees A, i.e [A B ... A], B turns out to be one of the highly likely next tokens.
Anthropic researchers found these phenomenon in as little as 2L transformer. One of the conclusions is that the model has learnt to increase the logits on B if the last token in the sequence is A and indeed, it's theorized that Induction heads is one of the mechanisms behind In-context learning.
For this to be true, there has to be a previous-token head. This ensures that the first occurrence of [B] pays attention to the first occurrence of [A] and the \(W_V\) matrix copies A to the subspace of B. Then when A occurs in the context again, for some head \(\hat{h}\), the second occurrence of A pays attention to the first occurrence of B, sees that A is in the residual stream of B and then copies B to the residual stream of the second occurrence of A and increases it's logits. This new head \(\hat{h}\) is an induction head.
We sample N = 25 random tokens from the vocabulary of a transformer language model and duplicate it along it's axis. This becomes the input to the transformer.
Input to the transformer is a matrix of shape [1, 2 * N + 1, d_model] i.e batch is 1, sequence length is 2 * N + 1
Pass this sequence of randomly repeated tokens into GPT2 and cache the activations. This can be done easily by loading the model with transformer lens and running
_, cache = model.run_with_cache(input_tokens)
Assume we have some head h at some layer l, the attention pattern is defined as, $$ \text{A}^{l, h} = \text{softmax}\left(\frac{Q_{l, h}K^T_{l, h}}{\sqrt{d_k}}\right) $$
We define induction score for head h in layer l as a measurement of how much attention a token in the second repeat (at position i + N) pays to its corresponding token in the first repeat (at position i). It's represented as:
def induction_head_detector( cache, cfg, ) -> list:
induction_heads = []
for layer_idx in range(cfg.n_layers):
for head_idx in range(cfg.n_heads):
# fetch the attention pattern at some layer and some head
attn_pattern = cache["pattern",layer_idx][head_idx]
rand_tok_seq_len = (attn_pattern.shape[1] -1) // 2
# compute the induction score for the attention pattern
score = attn_pattern.diagonal(-rand_tok_seq_len + 1).mean()
# filter with threshold of 0.4
if score.item() >= 0.4:
induction_heads.append((layer_idx, head_idx))
return induction_heads Below is a visual map of the induction heads present in GPT2
Below is an interactive visualization of the attention patterns for the induction heads identified above.
Load a tiny subset of the 10K pile dataset.For the purpose of this experiment, I used batch = 1 and sequence_length = 128.
Forward pass is also ran on this input and the activations are cached as in the case above as well.
For the induction heads that were identified in the section above, we simply visualize the attention pattern for these heads.
Guo et. al.,
They showed further evidence of this phenomenon by confirming that the value vectors of these tokens were much smaller than that of other tokens
This however isn't the only explanation for the first-token/special token phenomenon observed in attention heads. Federico