blank

Beyond the Lottery Ticket: Multiple Winning Subnetworks in Pretrained LLMs

2025-12-13T00:00:00+00:00

Introduction

Reinforcement learning fine-tuning is a new axis of scale for increased performance of Large Language Models (LLMs), with labs scaling compute for RL to levels on par with pretraining. Recent works have also attempted to shed light on the how and why RL really works .

Important to this report, Mukherjee et al. (2025) showed that RLVR finetunes a sparse subnetwork in LLMs, as little as 5-30% of parameters. With the goal of efficiency in mind, we ask the question, if most parameters don't change during training, can we identify which ones matter before training begins, and train only those? In our attempts to answer this question, we expected slightly complicated methods like Fisher Information matrix would be necessary to identify the "special" parameters that matter for learning. We were wrong.

In this report, we present preliminary findings showing that random parameter selection can match full fine-tuning performance when training only ~1% of parameters. This suggests pretrained models may contain not just one winning ticket but potentially many and we are calling this the Multiple Ticket Hypothesis.

This report details on-going work on a small scale and the main reason for sharing is that we think the temporary findings warrants discussion and are interesting enough to be shared with the wider community. An auxilliary reason is to solicit for compute resources to scale the experiments up.

tl,dr of results

Random parameter selection at 99% sparsity can match full parameter fine-tuning performance. This suggests pretrained models contain multiple viable subnetworks (the "Multiple Ticket Hypothesis").
Fisher Information masks also work, validating parameter importance identification methods, but surprisingly offer no clear advantage over random selection.
Different mask types require different optimal learning rates.

Background

Notation

Let $\theta$ denote the parameters of an LLM. We use $\theta^{(t)}$ to represent the model parameters at training step $t$, with $\theta^{(0)}$ denoting the initial pretrained model weights and $\theta_i$ to denote the i-th parameter.
During an RLVR run, gradients at step t, $g^{(t)}$, are computed via backpropagation, $$g^{(t)} = \nabla_\theta J_{\text{GRPO}}(\theta^{(t)})$$

General Experimental Setup

In this report, all experiments are carried out on Qwen2.5-0.5B-Instruct. We trained via GRPO on Kalomaze's Alphabetsort environment. We also use AdamW optimizer for all RLVR runs. This work was also built on Prime-Intellect's RL training library.

Evaluation: For evaluation, we also use the same Alphabetsort env, selecting 512 samples, seeded to 2001.

Extracting sparse subnetworks

Our initial intuition:
Imagine a pretrained LLM with only 2 parameters, p1 and p2. If only one parameter is changed at the end of a training phase with some optimization function $\phi$, say p1, it must mean that p1 is more important than p2 at satisfying $\phi$ is on the training set.

The question now is, how do we identify which parameters are most important for some training data D?

Fisher Mask Works

To identify which parameters are most important for a given task, we follow the approach laid out by Kirkpatrick et. al., . The authors estimated the importance of some weights to a task by approximating the Fisher information matrix of the model parameters.

We approximate the Fisher matrix, $F$ A justification for this is provided in the paper, but to reiterate, the core intuition here is that the magnitude of $F_i$ is correlated to how important parameter $\theta_i$ is to task represented by D. on a large batch of dataset for all the parameters of the model. $$F_i \approx \frac{1}{N} \sum_{n=1}^{N} \left( \frac{\partial \log p(x_n|\theta^{(0)})}{\partial \theta_i} \right)^2$$ where $x_n \sim \text{dataset D}$
In practise, we sample a large batch of data, run a forward pass, a backward pass and $$F_i = \theta_i.\text{grad}^2$$

We can then take the top x% of parameters in $F$, set these to True and all else to False and thus creating a binary mask $\text{MASK}_t \in \{0, 1\}^N$ over all parameters.

Training with mask

During training with a mask, we modify the gradient update step to only affect the masked parameters: $$\tilde{g}^{(t)} = g^{(t)} \odot \text{MASK}^{(t)}$$ $$\theta^{(t+1)} = \theta^{(t)} - \eta_t \cdot \mathcal{U}(\tilde{g}^{(t)}, \theta^{(t)})$$ where $\odot$ denotes element-wise multiplication, $\eta_t$ is the learning rate at step $t$, and $\mathcal{U}$ represents the optimizer's update rule (e.g., AdamW). This ensures that only the selected subnetwork is updated while the full model is still used for forward passes.

In practise and for efficiency gains, we simply store the optimizer states for the subnetwork only.

Results

We approximate $F$ using a batch of 1024 samples. We then created two masks, one at 99% sparsity i.e 4,940,328 / 494M parameters and another at 99.9% sparsity i.e 494,032 / 494M parameters. We compare the eval results, as well as the training dynamics in Figure 1

We use a learning rate of $10^{-6}$ for the full finetuning run, $5 \cdot 10^{-6}$ for the 99% fisher mask and $10^{-5}$ for the 99.9% fisher mask run.

Figure 1: Metrics comparison between a full finetune run and sparse training runs

This confirms our initial hypothesis that indeed, parameter-importance identification (with Fisher info matrix) might be a way to pickout which subnetworks allow us get comparable levels of performance with the full finetuning.

The Surprise: Random Masks Also Work

Having validated the initial intuition, we wanted to establish a baseline for comparison and investigated random parameter selection.We generated random masks at 99% sparsity by uniform sampling parameters to update.

Generating Random Masks

The implementation is pretty straightforward. We seed a random number generator and select (100 - x)% of parameters uniformly at random, to achieve x% sparsity.

We used three different seeds, 0, 2001 and 42 to get different masks and ran an RL run with these random masks. The results in Figure 2 are at $10^{-4}$, $5 \cdot 10^{-5}$ and $5 \cdot 10^{-5}$ respectively.

rng = np.random.default_rng(seed=42)
for name, param in model.state_dict().items():
    if param is None:
        mask_dict[name] = None
        continue

    temp_tensor = torch.zeros_like(param, dtype=torch.bool)
    num_to_generate = int(param.numel() * keep_ratio)
    indices = rng.choice(param.numel(), size=num_to_generate, replace=False)
    temp_tensor.view(-1)[indices] = True
    active += num_to_generate
    mask_dict[name] = temp_tensor

Surprising Results

Figure 2 surprisingly shows that random parameter selection can match full fine-tuning performance. This finding challenges our initial assumption that some sophisticated parameter identification method would be necessary.

Figure 2: Comparison of full fine-tuning (FFT) and random mask training. With appropriate learning rate tuning, random masks match or exceed full fine-tuning performance.

The learning rate puzzle

The key to making random masks and even the fisher mask work is finding the right learning rate. We swept over multiple learning rates for the random masks at 99% sparsity Some of the runs were cancelled and therefore aren't present because right from the start, the reward and eval curve do not improve and it felt wasteful to continue to burn through compute for results we already could intuit. to better understand this relationship and presents our findings in Figures 3 and 4.

Random masks perform best at higher lr, compared to full finetuning (and fisher masks). This isn't dissimilar to Thinkymachine's work on lora.

Figure 3: Hyperparameter sweep on learning rate for the random mask at Step 150

Figure 4: Hyperparameter sweep on learning rate for the random mask at Step 300

Why Different Learning Rates?

We hypothesize that this learning rate difference paints interesting pictures about the objective we are optimizing for and the training dynamics, with respect to the parameters of the model. Some of our hypotheses are:

Fisher masks identify parameters already near optima: The Fisher Information Matrix identifies parameters with high curvature which could be interpreted to be that those parameters are sensitive to changes. These parameters may already be close to their optimal values for the task, requiring only small adjustments (hence lower learning rates).
Random masks require more exploration or wiggling around: Random parameters are likely further from their optimal values on average, requiring larger updates to find good solutions (hence higher learning rates).
Different regions of the loss landscape: Fisher masks may operate in high-curvature regions where large steps cause instability, while random masks may, on average, be in a region that appears flat and large steps are relatively safer.

Do Different Masks Select the Same Parameters?

A natural question: are the random masks accidentally selecting the same parameters that Fisher masks identify? To answer this, we compute the Jaccard overlap between different masks, defined as $$ J(A, B) = \frac{|A \cap B|}{|A \cup B|} $$

Figure 5: Jaccard overlap between the Random masks at 99% sparsity and the Fisher mask.

The Jaccard overlap between the random masks and the Fisher mask, as shown in Fig. 5 is low, about 0.5% on average. This means that the random masks and Fisher mask select almost completely different parameters, yet achieve comparable performance to full fine-tuning.

Implications: The Multiple Ticket Hypothesis

These results suggest that LLMs appear to contain multiple viable sparse subnetworks that could be optimized on some task, not just one, for the Alphabet-sort task.

The Lottery Ticket Hypothesis (Frankle & Carbin, 2019) proposed that dense networks contain sparse subnetworks that can be trained to match the full network's performance. Frankle and Carbin used iterative magnitude pruning to identify a single winning ticket.

Our findings extend the original LTH to the MTH: For sufficiently over-parameterized pretrained models, there may not be just one winning ticket, but potentially many winning tickets — so many that even random selection is likely to find one i.e You can just ~~do things~~ select random parameters and train.

This explains why Fisher Information masks offer no clear advantage over random selection: Both methods (random masks and fisher masks) simply need to select some viable subnetwork and with appropriate hyperparameter tuning, they would succeed.

Caveats and Questions

These are preliminary results on a small model (Qwen2.5-0.5B) and simple task (alphabet-sort). More questions and ideas to investigate reveal themselves:

Does this phenomenon hold for larger models and different (harder) tasks like math, code gen, logical reasoning (or any task that we might want to make the model good at with RLVR)? We suspect it would. It should also be straightforward to investigate this and it would be pleasantly surprising if it doesnt hold.
It would seem logical that the parameters being repurposed for task A under a random mask training might be close to optimal for another task B. How does this random mask training affects the RLVR's ability to reduce catastrophic forgetting (compared to SFT) on some other task it has been trained on? We're inclined to think that it would in some non-trivial way lead to poorer performance on some other previously trained-on task B, but what sort of task B?
Some experiments (not recorded here) on even more extreme sparsity level like 99.9% and 99.95% do not match the full performance. At 99.9% the max across steps was about 46% and even less for 99.95%. What's the threshold for the number of parameters here? It's not obvious that more params will equal more performance for the simple reason that using all the params is capped at some level. But at what threshold do we start getting comparable performances?
On training dynamics, we also observe different convergence rates. From preliminary experiments, higher learning rates converge faster, at smaller sparsities, but they also become unstable more frequently than other runs as well. The question to investigate here isn't clearly formulated here, but it was still an interesting observation nonetheless
Do random masks transfer across tasks, or are they task-specific? We don't think they are task specific, but we don't also think that the same mask would behave the same across the different training tasks.
Why exactly do different masks require different optimal learning rates? How do we reason across this in relation to optimization theory specifically?

Answering these questions requires significantly more compute than we currently have access to. If you're interested in collaborating, mentoring or sponsoring compute, please reach out!!

Acknowledgements

I am super grateful to Daniel and Andreas for sponsoring compute for the initial experiments as well as asking really insightful questions.
PrimeIntellect also cooked with the prime-rl library. It was pleasant to hack around.

Citation

If you find this work useful, please cite:

@misc{adewuyi2025lottery,
  author = {Adewuyi, Israel},
  title = {Beyond the Lottery Ticket: Multiple Winning Subnetworks in Pretrained LLMs},
  year = {2025},
  month = {December},
  url = {https://israel-adewuyi.github.io/blog/2025/slim-peft/},
  note = {Blog post}
}

Attention sink

2025-07-23T00:00:00+00:00

Intro

I read the induction heads paper a while back, while taking the ARENA course. The paper lays out a super interesting mechanistic study for in-context learning and specifically examines induction head in transformer language models.

While playing around with induction heads in GPT2, I thought to myself that "What if the input to induction heads isn't present, what do the induction heads pay attention to?" I thought this might be a good question to investigate and after a quick literature search, I stumbled on the attention sink paper and a bunch of other works that made fantastic attempts at answering the question.

Guo, et al., in "Active-Dormant Attention Heads" investigated the same question but from a different angle. They trained a 3L GPT2-style transformer on bigram backcopy task and then investigated which heads were heavily involved in the backcopy task. Then they showed this heads were dormant when the bigram backcopy input isn't present.

While the question was sort of answered already, I thought it would still be a good exercise present the thought process I went through while attempting to answer the question.

In this remainder of this post, I briefly motivate what an attention head is doing, explain induction heads and how to look for them (with visualizations) and show what happens when the induction heads input isn't present.

Feel free to skip parts you're familiar with.

What is an attention head doing ?

Intuitive overview

In summary, attention heads move information between tokens!

Fig. 1: A simplified view of the transformer. Source: A mathematical framework for transformer circuits.

The residual stream is the main object in the transformer. A way I think of it is that it represents what the model currently thinks about all the tokens in it's context, up to a particular layer. To enrich and further refine the representation of the tokens in the context, attention heads move information from earlier tokens in the context to later tokens in the context and MLP blocks compose information and perform retrieval tasks .

More concretely

The input to the attention layer is the residual stream In the first layer, this is the sum of token embeddings and positional embeddings. with shape [batch_size, seq_len, d_model]. This input is linearly projected using three weight matrices: W_Q, W_K, and W_V, each of shape [d_model, d_model], to produce the query (Q), key (K), and value (V) matrices.

In multi-head attention, Q, K, and V are split into num_heads parts. Each head processes a subspace of the input, with Q and K shaped as [batch_size, num_heads, seq_len, d_k] and V as [batch_size, num_heads, seq_len, d_v], where d_k = d_v = d_model / num_heads.

For each head, attention scores are computed as the dot product of query and key vectors, scaled by 1/√d_k. The scores are passed through a softmax to obtain the attention pattern, which represents the importance of each token relative to others. This pattern is then multiplied by the value vectors to produce the head's output. The outputs of all heads are concatenated and projected using a weight matrix W_O of shape [d_model, d_model] to yield the final attention output, shaped [batch_size, seq_len, d_model].

From Vaswani et al. , the attention mechanism is defined as: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Multi-head attention is expressed as: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O $$ $$ \text{where } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$ Here, W_i^Q, W_i^K, and W_i^V are head-specific projection matrices.

Further questions

The next logical question is, how does each attention head across all the layers know what sort of information to pay attention to? During pre-training, the goal is optimizing the next-token objective w.r.t the parameters of the model, over the language domain. It stands to reason that over the course of multiple steps of gradient descent, each attention head learns to pay attention to some pattern (semantic or syntactic) in the language data and this pattern, when learned, contributes to lower loss.

And indeed, numerous papers have explored this assumption.

In both decoder-only and encoder-decocder transformers, attention heads have been discovered that specialize in attending to different parts of speech, as well as other lingustic propertites such direct objects of verbs, noun determiners, e.t.c.

Interesting mechanisms that further enable LLMs to act autoregressively have been discovered, such as Copy Supression heads and Induction heads

A logical conclusion of the above paragraphs is that what attention heads pay attention to is input-specific. This begs the question : What does an attention head pay attention to, when it's input isn't present?

Induction heads

I'll present a super simplified explanation of Induction heads here, but to better understand Induction heads mechanistically, Callum McDougall wrote a quite interesting explainer blog which I invite readers to check out. The paper also goes into a lot more details that I only mention slightly such as the presence of previous-token heads and the role of the QK/OV circuit.

Assume arbitrary tokens A, B. Then assume a sequence of tokens with A followed by B and then some other arbitrary tokens. The next time the model sees A, i.e [A B ... A], B turns out to be one of the highly likely next tokens.

Anthropic researchers found these phenomenon in as little as 2L transformer. One of the conclusions is that the model has learnt to increase the logits on B if the last token in the sequence is A and indeed, it's theorized that Induction heads is one of the mechanisms behind In-context learning.

For this to be true, there has to be a previous-token head. This ensures that the first occurrence of [B] pays attention to the first occurrence of [A] and the $W_V$ matrix copies A to the subspace of B. Then when A occurs in the context again, for some head $\hat{h}$, the second occurrence of A pays attention to the first occurrence of B, sees that A is in the residual stream of B and then copies B to the residual stream of the second occurrence of A and increases it's logits. This new head $\hat{h}$ is an induction head.

Identifying Induction heads in GPT2

Inputs

We sample N = 25 random tokens from the vocabulary of a transformer language model and duplicate it along it's axis. This becomes the input to the transformer.

Input to the transformer is a matrix of shape [1, 2 * N + 1, d_model] i.e batch is 1, sequence length is 2 * N + 1 + 1 because we append the bos token to the sequence and d_model = embedding dimension of the transformer language model.

Pass this sequence of randomly repeated tokens into GPT2 and cache the activations. This can be done easily by loading the model with transformer lens and running
_, cache = model.run_with_cache(input_tokens)

Metric

Assume we have some head h at some layer l, the attention pattern is defined as, $$ \text{A}^{l, h} = \text{softmax}\left(\frac{Q_{l, h}K^T_{l, h}}{\sqrt{d_k}}\right) $$

We define induction score for head h in layer l as a measurement of how much attention a token in the second repeat (at position i + N) pays to its corresponding token in the first repeat (at position i). It's represented as:

$$I(l, h) = \frac{1}{N} \sum_{i = 1} ^N A^{l, h} [i + N, i] $$

Identifying induction heads

Retrieve the attention pattern from the cache and for each head in each layer, calculate the induction score as defined above.

def induction_head_detector( cache, cfg, ) -> list:
    induction_heads = [] 
    for layer_idx in range(cfg.n_layers): 
        for head_idx in range(cfg.n_heads): 
            # fetch the attention pattern at some layer and some head
            attn_pattern = cache["pattern",layer_idx][head_idx] 
            rand_tok_seq_len = (attn_pattern.shape[1] -1) // 2 
            # compute the induction score for the attention pattern
            score = attn_pattern.diagonal(-rand_tok_seq_len + 1).mean() 
            # filter with threshold of 0.4
            if score.item() >= 0.4:
                induction_heads.append((layer_idx, head_idx)) 
    return induction_heads

Results

Below is a visual map of the induction heads present in GPT2

Below is an interactive visualization of the attention patterns for the induction heads identified above.

What happens if the induction input isn't present?

Inputs

Load a tiny subset of the 10K pile dataset.For the purpose of this experiment, I used batch = 1 and sequence_length = 128.

Forward pass is also ran on this input and the activations are cached as in the case above as well.

For the induction heads that were identified in the section above, we simply visualize the attention pattern for these heads.

Results

As can be observed, these heads all pay an overwhelming amount of attention to the first token.

Closing thoughts

Guo et. al., observed that not only the first token, but other special tokens, get an overwhelming amount of attention in dormant cases.

They showed further evidence of this phenomenon by confirming that the value vectors of these tokens were much smaller than that of other tokens This lends evidence to the fact that the information being written back to the residual stream is not of huge consequence. and the residual stream norm for this tokens were relatively small as well.

This however isn't the only explanation for the first-token/special token phenomenon observed in attention heads. Federico Personally, I enjoy Federico's papers and especially his interview on MLST podcast. et. al. , has a paper where he also investigates why attention sinks exists and presents an alternative explanation. TLDR: They serve the purpose of preventing mode collapse.

Reinforcement Learning Meets NER

2025-05-01T00:00:00+00:00

TLDR

This work represents preliminary experimental reports, and should be treated as such. See the closing thoughts for more details.

Via RL training, we achieved up to 7.3% increase in F1 score on a 1.5B model, on an NER task, compared to 175B GPT3-based baselines.
The RL trained model underperforms some other approaches, mostly involving some form of SFT.
We offer some closing thoughts on this as well as future possible directions of research.

Feel free to skip to Methods section or start from the introduction below.

Code is up at this repo

Introduction

Large Language Models (LLMs) built on the Transformer architecture have transformed Natural Language Processing (NLP), achieving SOTA results in tasks such as text generation, translation, and sentiment analysis . At the same time, Named Entity Recognition (NER)—the process of identifying and classifying proper names and other key terms in text—remains a core NLP task in applications like information extraction, question answering, machine translation .

Recently, researchers have revisited Reinforcement Learning (RL) as a means of adapting LLMs to specific objectives without full retraining. By defining reward functions or training reward models that is specific to the domain/task in question, RL fine‑tuning can elicit desired behaviors from a pre‑trained model. This approach has already shown promise in areas such as competitive mathematics and code generation .

In this work, we bring these threads together. We fine‑tune a 1.5B Qwen2.5 model on the CoNLL2003 NER , using carefully designed reward signals to guide entity recognition performance. Our results demonstrate that, even with modest model size, RL‑based adaptation can rival much larger architectures such as GPT‑3, highlighting the potential of reinforcement learning for structured NLP tasks.

Background

Why LLMs suck at NER

LLMs like Qwen2.5-1.5 , Llama , Gemini , e.t.c are pretrained on massive datasets with the objective being to predict the next token in a sequence. This makes them great for tasks like text generation but less effective for NER.

Why?

NER requires a different approach, as it involves identifying and classifying entities within text. This is a token-level task, where the goal is to label specific tokens accurately.

Imed et. al. provided a comprehensive overview of the advances in the field of NER, but most relevant approaches to this work, which are summarized below, are approaches that leverage LLMs to solve NER task.

Summary of related approaches

The core idea in most of the works listed below involves primarily prompt engineering Which is just a fancy way of saying almost all of them reformulated the task in some way that the LLM might be able to solve easier. and some combination of In Context Learning (ICL) and Supervised Finetuning (SFT). In the ICL paradigm, LLMs learn new tasks by being shown a few examples (few-shot) in the prompt which makes them flexible for new tasks without extra training while in the SFT paradigm, an LLM is further trained on specific, labeled data to make it better at the said task. It's akin to fine-tuning a general tool for a specific job.

GPT-NER introduced by Wang et al and LTNER introduced by Yan et al reformulated the sequence labelling task as a text generation task by prompting the LLM to generate the input text with the identified entities marked by special tokens. Both methods rely heavily on ICL. For retrieving these few-shot examples, GPT-NER investigated various strategies including embeddings derived from a fine-tuned NER model while LTNER utilizes vector-based retrieval from a knowledge base to find the most relevant examples for contextual learning. Additionally, GPT-NER introduced a self-verification strategy to combat hallucination.

PromptNER introduced by Ashok and Lipton kept the task as a sequence labelling task By prompting the LLM to list the entities in the text, given a predefined list of entities, but they introduced Chain-of-Thought (CoT) prompting, as well as giving an explanation of all the predefined entity types.

GoLLIE introduced in and InstructUIE introduced in both proposed instruction tuning frameworks for information retrieval using LLMs. GoLLIE fine-tunes an LLM to follow annotation guidelines, with tasks and guidelines represented in a code-based format. In contrast, InstructUIE employs natural language instructions within a unified text-to-text framework to model various IE tasks.

CodeIE and Code4UIE also transform the sequence labelling task into a code generation task to leverage the code generation capabilities of LLMs.

Method	Approach	Model size
GPT-NER	ICL	Text-davinci-003
LTNER	ICL	GPT-3.5-turbo
PROMPT-NER	ICL	GPT4
CodeIE	ICL	Code-davinci-002
Code4UIE	ICL	Text-davinci-003
GPT-NER	ICL + SFT	Text-davinci-003
GoLLIE	SFT	Code-LLaMA 34B
InstructUIE	SFT	Flan-T5-11B

Method

Dataset

The CoNLL2003 introduced by Eric et. al. has four types of named entities: Location (LOC), Organization (ORG), Person (PER), and Miscellaneous (MISC). We leveraged the preprocessed NER dataset by Li et al. (2019a) . A sample from the dev set which is downloadable from their github repo looks like

  {
    "context": "4 - Goran Ivanisevic ( Croatia ) beat Scott Draper ( Australia ) 6-7 ( 1-7 ) 6-3 6-4 6-4",
    "end_position": [
      3,
      9
    ],
    "entity_label": "PER",
    "impossible": false,
    "qas_id": "174.2",
    "query": "person entities are named persons or family.",
    "span_position": [
      "2;3",
      "8;9"
    ],
    "start_position": [
      2,
      8
    ]
  }

context - the input text from which entities are to be extracted.
entity_label - the entity to be extracted
query - an explanation of the entity to be extracted.

Prompt

  """
  A conversation between User and Assistant. The User provides a string of words. 
  The task of the Assistant is to identify all the {entity_label} entities 
  in the given string and return the entities surrounded by an entity tag.
  DESCRIPTION: {query}
  
  The reasoning process should be enclosed within   tags, 
  and the relevant words should be enclosed within   tags.
  i.e  reasoning process here   comma separated 
  list of words that are locations
  
  {example}
  
  User: {context}
  Assistant: 
  """

entity_label - the entity to be extracted
query - an explanation of the entity to be extracted - same as in the dataset.
example - few shot examples for the current entity_label
context - the input text from which entities are to be extracted.

The prompt construction method used in this work relies heavily and closely mirrors the fantastic work done by in the research literature. An overview of the relevant parts are provided below:

Task description

Following the preprocessing step done by Eric et. al. and Wang et al. , for each input sentence, N prompts are constructed where N is the number of entities in the dataset N = 4 for CoNLL2003. Following the PromptNER paper, we asked the model to generate a CoT inside the think tag and we also added an explanation for each entity type to the prompt.

Few shot demonstrations

To generate in-context examples, we randomly sample 2 sentences from the dev set of the CoNLL dataset for each entity type Deepseek-V3-base is prompted to solve the task and the result is manually inspected and included in the prompt, without any modifications.

LLM Output Format.

As visible from the prompt template, the LLM is prompted to output the CoT inside the think tag and then the list of entities in the entity tag.

As an example, consider the following. The LLM is prompted to identify the location entity, LOC in the following text: Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday . Results from Deepseek-V3-base :

    1. **Japan**: This is a country in East Asia, which is a politically and geographically defined location. It fits the criteria for a LOCATION entity. 
  2. **Asian Cup**: This refers to a football tournament, not a specific geographic or political location, so it does not qualify as a LOCATION entity. 
  3. **Syria**: This is a country in the Middle East, which is a politically and geographically defined location. It fits the criteria for a LOCATION entity. 
  4. **Group C**: This refers to a group in the tournament, not a geographic or political location, so it does not qualify as a LOCATION entity. 
  5. The rest of the words ("began", "the", "defence", "of", "their", "title", "with", "a", "lucky", "2-1", "win", "against", "in", "a", "championship", "match", "on", "Friday") are not location names and do not fit the criteria. 
    
   Japan, Syria

RL and Reward Design

To guide the model to output things in the desired format, we utilize GRPO which discards the critic model which is meant to provide some baseline for the advantage estimates. Instead, for each prompt, GRPO samples m outputs, which are referred to as group and the average outcome reward from the group serves as an estimate for the baseline.

For each output in the group, four reward functions were designed to provide signals to the policy model. The design of the reward function was inspired by the amazing work of Willccb's grpo demo.

Soft Format Reward: Awards 0.5 if the output follows the required format (... ...), ensuring structural consistency.
Correctness Reward: Gives 2.0 if the extracted entities exactly match the ground truth, emphasizing accuracy.
Positive Entity Correctness Reward: Adds 0.5 for each correctly identified entity, rewarding partial correctness.
Negative Entity Correctness Reward: Subtracts 0.5 for each incorrectly included entity, penalizing overprediction.

These rewards work together to encourage the model to identify entities accurately while adhering to the expected format. For example, if the model correctly identifies “Japan, Syria” as Locations but includes an extra incorrect entity, it receives a positive reward for the correct entities but a penalty for the mistake.

Example Reward Calculation:

Input: “Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria.”

Ground Truth: Japan, Syria

Model Output: Japan, Syria, Asian Cup

Soft Format: 0.5 (correct format)
Correctness: 0.0 (not an exact match)
Positive Entity: 1.0 (0.5 for Japan + 0.5 for Syria)
Negative Entity: -0.5 (penalty for Asian Cup)

Experiment

As stated earlier, Qwen2-1.5B-Instruct was utilized in this experiment. TRL library provided by HuggingFace, alongsides modifications to Willccb's grpo demo repository, were used to construct a training and eval pipeline. The model was also trained on 4 epochs on 2 Nvidia A100 GPUs. Relevant hyperparameters are:

  training_args = GRPOConfig(
    ...
    learning_rate=5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type='cosine',
    bf16=True,
    per_device_train_batch_size=4,
    num_generations=8,
    max_prompt_length=2048,
    max_completion_length=2048,
    num_train_epochs=4,
    save_strategy="epoch",
    max_grad_norm=0.1,
    report_to="wandb",
    log_on_each_node=False,
  )

More details can be found in the repository.

Evaluation

F1 score is reported for the RL-trained model.

To compare with other methods, the best F1 score evaluation across 4 epochs is reported.
VLLM was used for sampling from the transformers.
To ensure robustness, F1 score across 16 generations are averaged and reported, both in comparison to other methods and in comparison across epochs.
To evaluate the RL-trained model, few shot examples were included in the prompt.

Results / Charts

F1 score comparison across approaches

The chart below shows the F1 score of the fine-tuned model in comparison with other methods described in the relevant works section.

Model sizes comparison

The chart below compares the model sizes across all the methods listed above

Notes:

Parameter count of code-davinci-003 isn't publicly known, but inferring from the naming convention, it's based on the GPT3 model, which is reported to be 175B.
GPT3.5-Turbo also doesn't have it's parameter count publicly known. But based on the the GPT3 model count, 175B is conservative.
The same logic applies to GPT4. Though, it's rumoured to be around 1.7T parameters. So 175 is also conservative.

F1 score comparison across epochs

The chart below shows the result of evaluating a saved checkpoint from each epoch as well as the base Qwen1.5B model with few-shot examples or without.

Closing thoughts

This work explored, albeit, in a limited scope, how much RL training can improve the performance of LLMs on NER task and the results show impressive performance for small-sized LLMs.

It's important to emphasize (and this are my current thoughts) that RL training benefits from a good base pretrained model, as the current RL training paradigm encourage exploitation more than they do exploration. It's useful to think of the current RL training paradigm as stabilizing the distribution over the domain of interest, so if the model cannot, under any inference sampling, over number of generations that tend to infinity, sample the answer, most likely it cannot be learned during an RL training paradigm. This intuition is informed by the performance of LLMs on math and code tasks under base + SFT + RL vs only base and RL.

Future work would investigate other NER datasets (in other domains), how much performance is lost or retained in other tasks/benchmarks of interest, different structures to reward model as well as how small can the models be to get competitive performance, especially in real world applications, where the tradeoff of efficiency and correctness is often the focal point.

It would also be interesting to do Interpretability on these models. What about the model changes when it's RL trained

Acknowledgements

Gratitude goes to the Institute of Software Engineering at Innopolis university, led by Professor Vladmir Ivanov for providing compute. I am also grateful to the following persons:

Ilnur Khadiev for helping out with fixing GPU-related issues every so often.
Nursultan Abdullaev for providing feedback on an initial draft.

Citation information

Cite as:

 Israel, Adewuyi. (May, 2025). Reinforcement Learning meets NER https://israel-adewuyi.github.io/blog/2025/ner_with_rl/. 


  @article{israel2025ner_rl,
    title   = "Reinforcement Learning Meets NER",
    author  = "Israel, Adewuyi",
    journal = "israel-adewuyi.github.io",
    year    = "2025",
    month   = "May",
    url     = "https://israel-adewuyi.github.io/blog/2025/ner_with_rl/"
  }

Replicating GraphRAG paper

2024-11-08T00:00:00+00:00

Introduction

Microsoft research recently put out the GraphRAG paper . In this post, I share my attempt at replicating the paper and some thoughts about tradeoffs to be made when working with retrieval systems in general.

The summary of the paper is, we can structure the information in a body of documents as a graph by thinking of every object in the document as an entity and drawing it's relationship with other entities. Once we have this graph, we can then reason over it to draw insights and conclusions that otherwise we might not be able to draw.

GraphRAG pipeline.

In the paper, Microsoft research used podcast transcripts and news article as the knowledge source over which retrieval is done. I decided to use a podcast episode - specifically, Dwarkesh Patel's interview with Mechanistic Interpretability researchers Trenton Bricken and Sholto Douglas.

I would expect readers to be fairly familiar with the GraphRAG paper .

Organizatinal notes

I encourage readers to probably read through the whole post and refer to this section from time to time.

For evaluation, I used one of the transcripts of Dwarkesh Patel's podcast as the knowledge source.
During development, I used a bunch of models to test out the different components, but for the final graph index generation and inference, I used LLAMA 3.2 90B text-preview, provided by Groq. That being said, Gemma 9B seems to perform the best on entity-relationship extraction I judged this because I have listened to the podcast episode and I was able to roughly access the quality of the generation for different models. .
I implemented a single hierarchy of clustering graph nodes and edges.
For the sake of optimizing for API the number of calls / requests to Groq, I implemented global search with a vectorDB, as opposed to LLM summarization in the graphrag paper.
I did not implement local search, covariates and a couple of other details that were token/api calls-expensive.
LLM-derived Knowledge graph can be viewed here
Link to repository, Link to streamlit chat interfaceI did not exactly optimise the chat interface to be a conversational agent like SOTA chat agents. Ask questions, get a response.

Text Chunking

Dwarkesh provides links to the transcripts of his podcast episodes, so it was easy to get the transcripts of the episode. To preserve the notion of turn-based conversation, I chunked the transcript text based on each speaker's speech.

More info

Replicating ‘Refusal Mechanism’

2024-10-05T00:00:00+00:00

Background

This post represents a step towards my understanding of model behaviour and how to align LLMs with our interests. When I first read the blog, it seemed approachable on the surface level, I felt I could track what the author was doing as well as their motivations and it felt like a good experiment to try and replicate.

This also represents an attempt to upskill on Mechanistic Interpretability tooling.

This post is based on . If you need a more indepth explanation, or a refresher, I suggest the reader goes through the blog and return, because this writeup just summarises my findings and assumes the reader is familiar with mech interp-related terms.

Summary

I investigated the refusal behaviour as described in on the Gemma 2 suite of models, specifically Gemma 2-2B and Gemma 2-9B.
I couldn't steer with the refusal heads contribution with Gemma 2-2B.
I could steer with the refusal heads contribution with Gemma 2-9B albeit, with significant increase in the scaling factor, > 26x
For both models, I could steer using the difference vector.
For inhibiting the refusal behaviour on harmless prompts, I could not steer with both the refusal head contribution as well as the difference vector.

Setup

To measure the refusal behaviour, used logit[sorry] - logit[sure] as the metric My intuition is that this metric is quite lossy. See Takeaways for disucssion on this. . A justification being that, if the model would refuse a behaviour, part of the generation starts with “Sorry” and if a model would act out the behaviour, “Sure” would be one of the top next predicted logits.

Initially, I tried using the dataset of harmful and harmless objects used, but I ran into troubles making sense of the results. Upon investigation, I realized some objects were multi-token, which was just a curse to analyze. So I decided to cherry-pick objects that were single token, instead Link to dataset I used.

I followed Gemma instruction prompt template.

Results with Gemma 2-2B

Residual stream patching

This doesn't compare cleanly with the results from . The absolute value of the refusal score for harmful logits appears to be higher here the absolute value of the refusal score for harmful logits in . For harmless logits, the opposite appears to be true.

Residual stream activation patching

The results at the obj token position as well as the last token position is expected. At the '.' token position, which would be henceforth regarded as the post obj token position, layers 8 - 15 seems to be carrying signals related to the refusal behaviour. Going forward, these layers are layers of interest.

Attention layer activatin patching

The resid_post at any layer can be decomposed into resid_post = resid_pre + attn_out + mlp_out . So let's see what's up with attn_out .

A surprising result is that attn_out activation patching cannot fully recover the refusal behaviour. This is evident because at the post-object token position, the refusal score at the final layer is 0.7707379.
By layer 15, the score is close to the final layer's refusal score - 0.7689, which seems to correlate with the results from residual stream activation patching and suggest that indeed, layers after layer 15 aren't contributing as much to the refusal behaviour.

MLP Layer activation patching

I decided to run activation patching on the MLP out of each layer as well, just to see what gives.

In retrospect, this result makes sense. One can think of it as, patching in at the obj token position is analogous to replacing the harmless objects with the harmful object in the prompt. The refusals score at obj position is 0.888.

Attention heads activation patching

Setting an arbitrary threshold of 0.005, 11 heads were contributing to the refusal behaviour and this set of heads were selected to be the sufficient for the refusal behaviour.

blank

Beyond the Lottery Ticket: Multiple Winning Subnetworks in Pretrained LLMs

Introduction

tl,dr of results

Background

Notation

General Experimental Setup

Extracting sparse subnetworks

Fisher Mask Works

Training with mask

Results

The Surprise: Random Masks Also Work

Generating Random Masks

Surprising Results

The learning rate puzzle

Why Different Learning Rates?

Do Different Masks Select the Same Parameters?

Implications: The Multiple Ticket Hypothesis

Caveats and Questions

Acknowledgements

Citation

Attention sink

Intro

What is an attention head doing ?

Intuitive overview

More concretely

Further questions

Induction heads

Identifying Induction heads in GPT2

Inputs

Metric

Identifying induction heads

Results

What happens if the induction input isn't present?

Inputs

Results

Closing thoughts

Reinforcement Learning Meets NER

TLDR

Introduction

Background

Why LLMs suck at NER

Summary of related approaches

Method

Dataset

Prompt

Task description

Few shot demonstrations

LLM Output Format.

RL and Reward Design

Experiment

Evaluation

Results / Charts

F1 score comparison across approaches

Model sizes comparison

F1 score comparison across epochs

Closing thoughts

Acknowledgements

Citation information

Replicating GraphRAG paper

Introduction

Organizatinal notes

Text Chunking

Replicating ‘Refusal Mechanism’

Background

Summary

Setup

Results with Gemma 2-2B

Residual stream patching

Residual stream activation patching

Attention layer activatin patching

MLP Layer activation patching

Attention heads activation patching

Steering

With difference vector

With activation vector

Results with Gemma 2-9B

Residual stream attribution

Residual stream activation patching

Attention Layer activation patching