Beyond the Lottery Ticket: Multiple Winning Subnetworks in Pretrained LLMs

Preliminary evidence that random parameter selection can match full parameter RL finetuning.

Introduction

Reinforcement learning fine-tuning is a new axis of scale for increased performance of Large Language Models (LLMs), with labs scaling compute for RL to levels on par with pretraining. Recent works have also attempted to shed light on the how and why RL really works .

Important to this report, Mukherjee et al. (2025) showed that RLVR finetunes a sparse subnetwork in LLMs, as little as 5-30% of parameters. With the goal of efficiency in mind, we ask the question, if most parameters don't change during training, can we identify which ones matter before training begins, and train only those? In our attempts to answer this question, we expected slightly complicated methods like Fisher Information matrix would be necessary to identify the "special" parameters that matter for learning. We were wrong.

In this report, we present preliminary findings showing that random parameter selection can match full fine-tuning performance when training only ~1% of parameters. This suggests pretrained models may contain not just one winning ticket but potentially many and we are calling this the Multiple Ticket Hypothesis.

This report details on-going work on a small scale and the main reason for sharing is that we think the temporary findings warrants discussion and are interesting enough to be shared with the wider community. An auxilliary reason is to solicit for compute resources to scale the experiments up.

tl,dr of results

Background

Notation

Let $\theta$ denote the parameters of an LLM. We use $\theta^{(t)}$ to represent the model parameters at training step $t$, with $\theta^{(0)}$ denoting the initial pretrained model weights and $\theta_i$ to denote the i-th parameter.
During an RLVR run, gradients at step t, $g^{(t)}$, are computed via backpropagation, $$g^{(t)} = \nabla_\theta J_{\text{GRPO}}(\theta^{(t)})$$

General Experimental Setup

In this report, all experiments are carried out on Qwen2.5-0.5B-Instruct. We trained via GRPO on Kalomaze's Alphabetsort environment. We also use AdamW optimizer for all RLVR runs. This work was also built on Prime-Intellect's RL training library.

Evaluation: For evaluation, we also use the same Alphabetsort env, selecting 512 samples, seeded to 2001.

Extracting sparse subnetworks

Our initial intuition:
Imagine a pretrained LLM with only 2 parameters, p1 and p2. If only one parameter is changed at the end of a training phase with some optimization function $\phi$, say p1, it must mean that p1 is more important than p2 at satisfying $\phi$ is on the training set.

The question now is, how do we identify which parameters are most important for some training data D?

Fisher Mask Works

To identify which parameters are most important for a given task, we follow the approach laid out by Kirkpatrick et. al., . The authors estimated the importance of some weights to a task by approximating the Fisher information matrix of the model parameters.

We approximate the Fisher matrix, $F$ A justification for this is provided in the paper, but to reiterate, the core intuition here is that the magnitude of $F_i$ is correlated to how important parameter $\theta_i$ is to task represented by D. on a large batch of dataset for all the parameters of the model. $$F_i \approx \frac{1}{N} \sum_{n=1}^{N} \left( \frac{\partial \log p(x_n|\theta^{(0)})}{\partial \theta_i} \right)^2$$ where $x_n \sim \text{dataset D}$
In practise, we sample a large batch of data, run a forward pass, a backward pass and $$F_i = \theta_i.\text{grad}^2$$

We can then take the top x% of parameters in $F$, set these to True and all else to False and thus creating a binary mask $\text{MASK}_t \in \{0, 1\}^N$ over all parameters.

Training with mask

During training with a mask, we modify the gradient update step to only affect the masked parameters: $$\tilde{g}^{(t)} = g^{(t)} \odot \text{MASK}^{(t)}$$ $$\theta^{(t+1)} = \theta^{(t)} - \eta_t \cdot \mathcal{U}(\tilde{g}^{(t)}, \theta^{(t)})$$ where $\odot$ denotes element-wise multiplication, $\eta_t$ is the learning rate at step $t$, and $\mathcal{U}$ represents the optimizer's update rule (e.g., AdamW). This ensures that only the selected subnetwork is updated while the full model is still used for forward passes.

In practise and for efficiency gains, we simply store the optimizer states for the subnetwork only.

Results

We approximate $F$ using a batch of 1024 samples. We then created two masks, one at 99% sparsity i.e 4,940,328 / 494M parameters and another at 99.9% sparsity i.e 494,032 / 494M parameters. We compare the eval results, as well as the training dynamics in Figure 1

We use a learning rate of $10^{-6}$ for the full finetuning run, $5 \cdot 10^{-6}$ for the 99% fisher mask and $10^{-5}$ for the 99.9% fisher mask run.

train and eval dynamics
Figure 1: Metrics comparison between a full finetune run and sparse training runs

This confirms our initial hypothesis that indeed, parameter-importance identification (with Fisher info matrix) might be a way to pickout which subnetworks allow us get comparable levels of performance with the full finetuning.

The Surprise: Random Masks Also Work

Having validated the initial intuition, we wanted to establish a baseline for comparison and investigated random parameter selection.We generated random masks at 99% sparsity by uniform sampling parameters to update.

Generating Random Masks

The implementation is pretty straightforward. We seed a random number generator and select (100 - x)% of parameters uniformly at random, to achieve x% sparsity.

We used three different seeds, 0, 2001 and 42 to get different masks and ran an RL run with these random masks. The results in Figure 2 are at $10^{-4}$, $5 \cdot 10^{-5}$ and $5 \cdot 10^{-5}$ respectively.

rng = np.random.default_rng(seed=42)
for name, param in model.state_dict().items():
    if param is None:
        mask_dict[name] = None
        continue

    temp_tensor = torch.zeros_like(param, dtype=torch.bool)
    num_to_generate = int(param.numel() * keep_ratio)
    indices = rng.choice(param.numel(), size=num_to_generate, replace=False)
    temp_tensor.view(-1)[indices] = True
    active += num_to_generate
    mask_dict[name] = temp_tensor
        

Surprising Results

Figure 2 surprisingly shows that random parameter selection can match full fine-tuning performance. This finding challenges our initial assumption that some sophisticated parameter identification method would be necessary.

comparison of fft and random mask runs
Figure 2: Comparison of full fine-tuning (FFT) and random mask training. With appropriate learning rate tuning, random masks match or exceed full fine-tuning performance.

The learning rate puzzle

The key to making random masks and even the fisher mask work is finding the right learning rate. We swept over multiple learning rates for the random masks at 99% sparsity Some of the runs were cancelled and therefore aren't present because right from the start, the reward and eval curve do not improve and it felt wasteful to continue to burn through compute for results we already could intuit. to better understand this relationship and presents our findings in Figures 3 and 4.

Random masks perform best at higher lr, compared to full finetuning (and fisher masks). This isn't dissimilar to Thinkymachine's work on lora.

Hyperparameter sweep on learning rate for the random mask at 150
Figure 3: Hyperparameter sweep on learning rate for the random mask at Step 150
Hyperparameter sweep on learning rate for the random mask at 300
Figure 4: Hyperparameter sweep on learning rate for the random mask at Step 300

Why Different Learning Rates?

We hypothesize that this learning rate difference paints interesting pictures about the objective we are optimizing for and the training dynamics, with respect to the parameters of the model. Some of our hypotheses are:

Do Different Masks Select the Same Parameters?

A natural question: are the random masks accidentally selecting the same parameters that Fisher masks identify? To answer this, we compute the Jaccard overlap between different masks, defined as $$ J(A, B) = \frac{|A \cap B|}{|A \cup B|} $$

jaccard sim
Figure 5: Jaccard overlap between the Random masks at 99% sparsity and the Fisher mask.

The Jaccard overlap between the random masks and the Fisher mask, as shown in Fig. 5 is low, about 0.5% on average. This means that the random masks and Fisher mask select almost completely different parameters, yet achieve comparable performance to full fine-tuning.

Implications: The Multiple Ticket Hypothesis

These results suggest that LLMs appear to contain multiple viable sparse subnetworks that could be optimized on some task, not just one, for the Alphabet-sort task.

The Lottery Ticket Hypothesis (Frankle & Carbin, 2019) proposed that dense networks contain sparse subnetworks that can be trained to match the full network's performance. Frankle and Carbin used iterative magnitude pruning to identify a single winning ticket.

Our findings extend the original LTH to the MTH: For sufficiently over-parameterized pretrained models, there may not be just one winning ticket, but potentially many winning tickets — so many that even random selection is likely to find one i.e You can just do things select random parameters and train.

This explains why Fisher Information masks offer no clear advantage over random selection: Both methods (random masks and fisher masks) simply need to select some viable subnetwork and with appropriate hyperparameter tuning, they would succeed.

Caveats and Questions

These are preliminary results on a small model (Qwen2.5-0.5B) and simple task (alphabet-sort). More questions and ideas to investigate reveal themselves:

Answering these questions requires significantly more compute than we currently have access to. If you're interested in collaborating, mentoring or sponsoring compute, please reach out!!

Acknowledgements

  1. I am super grateful to Daniel and Andreas for sponsoring compute for the initial experiments as well as asking really insightful questions.
  2. PrimeIntellect also cooked with the prime-rl library. It was pleasant to hack around.

Citation

If you find this work useful, please cite:

@misc{adewuyi2025lottery,
  author = {Adewuyi, Israel},
  title = {Beyond the Lottery Ticket: Multiple Winning Subnetworks in Pretrained LLMs},
  year = {2025},
  month = {December},
  url = {https://israel-adewuyi.github.io/blog/2025/slim-peft/},
  note = {Blog post}
}