Preliminary evidence that random parameter selection can match full parameter RL finetuning.
Reinforcement learning fine-tuning is a new axis of scale for
increased performance of Large Language Models (LLMs), with labs
scaling compute for RL to levels on par with pretraining. Recent works
have also attempted to shed light on the how and why RL really works
Important to this report, Mukherjee et al. (2025)
In this report, we present preliminary findings showing that random parameter selection can match full fine-tuning performance when training only ~1% of parameters. This suggests pretrained models may contain not just one winning ticket but potentially many and we are calling this the Multiple Ticket Hypothesis.
This report details on-going work on a small scale and the main reason
for sharing is that we think the temporary findings warrants
discussion and are interesting enough to be shared with the wider
community.
Let $\theta$ denote the parameters of an LLM. We use $\theta^{(t)}$ to
represent the model parameters at training step $t$, with
$\theta^{(0)}$ denoting the initial pretrained model weights and
$\theta_i$ to denote the i-th parameter.
During an RLVR run, gradients at step t, $g^{(t)}$, are computed via
backpropagation, $$g^{(t)} = \nabla_\theta
J_{\text{GRPO}}(\theta^{(t)})$$
In this report, all experiments are carried out on Qwen2.5-0.5B-Instruct. We trained via GRPO on Kalomaze's Alphabetsort environment. We also use AdamW optimizer for all RLVR runs. This work was also built on Prime-Intellect's RL training library.
Evaluation: For evaluation, we also use the same Alphabetsort env, selecting 512 samples, seeded to 2001.
Our initial intuition:
Imagine a pretrained LLM with only 2 parameters, p1 and p2. If only
one parameter is changed at the end of a training phase with some
optimization function $\phi$, say p1, it must mean that p1 is more
important than p2 at satisfying $\phi$ is on the training set.
The question now is, how do we identify which parameters are most
important for some training data D?
To identify which parameters are most important for a given task, we
follow the approach laid out by Kirkpatrick et. al.,
We approximate the Fisher matrix, $F$
In practise, we sample a large batch of data, run a forward pass, a
backward pass and $$F_i = \theta_i.\text{grad}^2$$
We can then take the top x% of parameters in $F$, set
these to True and all else to False and thus
creating a binary mask $\text{MASK}_t \in \{0, 1\}^N$ over all
parameters.
During training with a mask, we modify the gradient update step to only affect the masked parameters: $$\tilde{g}^{(t)} = g^{(t)} \odot \text{MASK}^{(t)}$$ $$\theta^{(t+1)} = \theta^{(t)} - \eta_t \cdot \mathcal{U}(\tilde{g}^{(t)}, \theta^{(t)})$$ where $\odot$ denotes element-wise multiplication, $\eta_t$ is the learning rate at step $t$, and $\mathcal{U}$ represents the optimizer's update rule (e.g., AdamW). This ensures that only the selected subnetwork is updated while the full model is still used for forward passes.
In practise and for efficiency gains, we simply store the optimizer states for the subnetwork only.
We approximate $F$ using a batch of 1024 samples. We then created two masks, one at 99% sparsity i.e 4,940,328 / 494M parameters and another at 99.9% sparsity i.e 494,032 / 494M parameters. We compare the eval results, as well as the training dynamics in Figure 1
We use a learning rate of $10^{-6}$ for the full finetuning run, $5 \cdot 10^{-6}$ for the 99% fisher mask and $10^{-5}$ for the 99.9% fisher mask run.
This confirms our initial hypothesis that indeed, parameter-importance identification (with Fisher info matrix) might be a way to pickout which subnetworks allow us get comparable levels of performance with the full finetuning.
Having validated the initial intuition, we wanted to establish a baseline for comparison and investigated random parameter selection.We generated random masks at 99% sparsity by uniform sampling parameters to update.
The implementation is pretty straightforward. We seed a random number
generator and select (100 - x)% of parameters uniformly
at random, to achieve x% sparsity.
We used three different seeds,
0, 2001 and 42 to get different
masks and ran an RL run with these random masks. The results in
Figure 2 are at $10^{-4}$, $5 \cdot
10^{-5}$ and $5 \cdot 10^{-5}$ respectively.
rng = np.random.default_rng(seed=42)
for name, param in model.state_dict().items():
if param is None:
mask_dict[name] = None
continue
temp_tensor = torch.zeros_like(param, dtype=torch.bool)
num_to_generate = int(param.numel() * keep_ratio)
indices = rng.choice(param.numel(), size=num_to_generate, replace=False)
temp_tensor.view(-1)[indices] = True
active += num_to_generate
mask_dict[name] = temp_tensor
Figure 2 surprisingly shows that random parameter selection can match full fine-tuning performance. This finding challenges our initial assumption that some sophisticated parameter identification method would be necessary.
The key to making random masks and even the fisher mask work is
finding the right learning rate. We swept over multiple learning rates
for the random masks at 99% sparsity
Random masks perform best at higher lr, compared to full finetuning (and fisher masks). This isn't dissimilar to Thinkymachine's work on lora.
We hypothesize that this learning rate difference paints interesting pictures about the objective we are optimizing for and the training dynamics, with respect to the parameters of the model. Some of our hypotheses are:
A natural question: are the random masks accidentally selecting the same parameters that Fisher masks identify? To answer this, we compute the Jaccard overlap between different masks, defined as $$ J(A, B) = \frac{|A \cap B|}{|A \cup B|} $$
The Jaccard overlap between the random masks and the Fisher mask, as shown in Fig. 5 is low, about 0.5% on average. This means that the random masks and Fisher mask select almost completely different parameters, yet achieve comparable performance to full fine-tuning.
These results suggest that LLMs appear to contain multiple viable sparse subnetworks that could be optimized on some task, not just one, for the Alphabet-sort task.
The Lottery Ticket Hypothesis (Frankle & Carbin, 2019)
Our findings extend the original LTH to the MTH:
For sufficiently over-parameterized pretrained models, there may
not be just one winning ticket, but potentially
many winning tickets — so many that even random selection
is likely to find one
i.e
You can just do things select random parameters and
train.
This explains why Fisher Information masks offer no clear advantage over random selection: Both methods (random masks and fisher masks) simply need to select some viable subnetwork and with appropriate hyperparameter tuning, they would succeed.
These are preliminary results on a small model (Qwen2.5-0.5B) and simple task (alphabet-sort). More questions and ideas to investigate reveal themselves:
Answering these questions requires significantly more compute than we currently have access to. If you're interested in collaborating, mentoring or sponsoring compute, please reach out!!
If you find this work useful, please cite:
@misc{adewuyi2025lottery,
author = {Adewuyi, Israel},
title = {Beyond the Lottery Ticket: Multiple Winning Subnetworks in Pretrained LLMs},
year = {2025},
month = {December},
url = {https://israel-adewuyi.github.io/blog/2025/slim-peft/},
note = {Blog post}
}