an attempt at solving Named Entity Recognition with RL training.
This work represents preliminary experimental reports, and should be treated as such. See the closing thoughts for more details.
Feel free to skip to Methods section or start from the introduction below.
Code is up at this repo
Large Language Models (LLMs) built on the Transformer architecture
Recently, researchers have revisited Reinforcement Learning (RL) as a
means of adapting LLMs to specific objectives without full retraining.
By defining reward functions or training reward models that is
specific to the domain/task in question, RL fine‑tuning can elicit
desired behaviors from a pre‑trained model. This approach has already
shown promise in areas such as competitive mathematics
In this work, we bring these threads together. We fine‑tune a 1.5B
Qwen2.5
LLMs like Qwen2.5-1.5
Why?
NER requires a different approach, as it involves identifying and classifying entities within text. This is a token-level task, where the goal is to label specific tokens accurately.
Imed et. al.
The core idea in most of the works listed below involves primarily
prompt engineering
GPT-NER introduced by Wang et al
PromptNER introduced by Ashok and Lipton
GoLLIE introduced in
CodeIE
Method | Approach | Model size |
---|---|---|
GPT-NER | ICL | Text-davinci-003 |
LTNER | ICL | GPT-3.5-turbo |
PROMPT-NER | ICL | GPT4 |
CodeIE | ICL | Code-davinci-002 |
Code4UIE | ICL | Text-davinci-003 |
GPT-NER | ICL + SFT | Text-davinci-003 |
GoLLIE | SFT | Code-LLaMA 34B |
InstructUIE | SFT | Flan-T5-11B |
The CoNLL2003 introduced by Eric et. al.
{ "context": "4 - Goran Ivanisevic ( Croatia ) beat Scott Draper ( Australia ) 6-7 ( 1-7 ) 6-3 6-4 6-4", "end_position": [ 3, 9 ], "entity_label": "PER", "impossible": false, "qas_id": "174.2", "query": "person entities are named persons or family.", "span_position": [ "2;3", "8;9" ], "start_position": [ 2, 8 ] }
context
- the input text from which entities are to be
extracted.
entity_label
- the entity to be extracted
query
- an explanation of the entity to be extracted.
""" A conversation between User and Assistant. The User provides a string of words. The task of the Assistant is to identify all the {entity_label} entities in the given string and return the entities surrounded by an entity tag. DESCRIPTION: {query} The reasoning process should be enclosed within <think> </think> tags, and the relevant words should be enclosed within <entity> </entity> tags. i.e <think> reasoning process here </think> <entity> comma separated list of words that are locations</entity> {example} User: {context} Assistant: """
entity_label
- the entity to be extracted
query
- an explanation of the entity to be extracted -
same as in the dataset.
example
- few shot examples for the current
entity_label
context
- the input text from which entities are to be
extracted.
The prompt construction method used in this work relies heavily and closely mirrors the fantastic work done by in the research literature. An overview of the relevant parts are provided below:
Following the preprocessing step done by Eric et. al.
To generate in-context examples, we randomly sample 2 sentences from the dev set of the CoNLL dataset for each entity type Deepseek-V3-base is prompted to solve the task and the result is manually inspected and included in the prompt, without any modifications.
As visible from the prompt template, the LLM is prompted to output the CoT inside the think tag and then the list of entities in the entity tag.
As an example, consider the following. The LLM is prompted to identify
the location entity, LOC
in the following text:
Japan began the defence of their Asian Cup title with a lucky 2-1
win against Syria in a Group C championship match on Friday .
Results from Deepseek-V3-base :
<think>
1. **Japan**: This is a country in East Asia, which is a politically and geographically defined location. It fits the criteria for a LOCATION entity. 2. **Asian Cup**: This refers to a football tournament, not a specific geographic or political location, so it does not qualify as a LOCATION entity. 3. **Syria**: This is a country in the Middle East, which is a politically and geographically defined location. It fits the criteria for a LOCATION entity. 4. **Group C**: This refers to a group in the tournament, not a geographic or political location, so it does not qualify as a LOCATION entity. 5. The rest of the words ("began", "the", "defence", "of", "their", "title", "with", "a", "lucky", "2-1", "win", "against", "in", "a", "championship", "match", "on", "Friday") are not location names and do not fit the criteria.</think>
<entity>
Japan, Syria</entity>
To guide the model to output things in the desired format, we utilize
GRPO
m
outputs, which are referred to as
group
and the average outcome reward from the group serves as an estimate
for the baseline.
For each output in the group, four reward functions were designed to provide signals to the policy model. The design of the reward function was inspired by the amazing work of Willccb's grpo demo.
<think>...</think>
<entity>...</entity>
), ensuring structural consistency.
These rewards work together to encourage the model to identify entities accurately while adhering to the expected format. For example, if the model correctly identifies “Japan, Syria” as Locations but includes an extra incorrect entity, it receives a positive reward for the correct entities but a penalty for the mistake.
Example Reward Calculation:
Input: “Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria.”
Ground Truth: <entity>Japan, Syria</entity>
Model Output:
<entity>Japan, Syria, Asian Cup</entity>
As stated earlier, Qwen2-1.5B-Instruct was utilized in this experiment. TRL library provided by HuggingFace, alongsides modifications to Willccb's grpo demo repository, were used to construct a training and eval pipeline. The model was also trained on 4 epochs on 2 Nvidia A100 GPUs. Relevant hyperparameters are:
training_args = GRPOConfig( ... learning_rate=5e-6, adam_beta1 = 0.9, adam_beta2 = 0.99, weight_decay = 0.1, warmup_ratio = 0.1, lr_scheduler_type='cosine', bf16=True, per_device_train_batch_size=4, num_generations=8, max_prompt_length=2048, max_completion_length=2048, num_train_epochs=4, save_strategy="epoch", max_grad_norm=0.1, report_to="wandb", log_on_each_node=False, )
More details can be found in the repository.
F1 score is reported for the RL-trained model.
The chart below shows the F1 score of the fine-tuned model in comparison with other methods described in the relevant works section.
The chart below compares the model sizes across all the methods listed above
Notes:
The chart below shows the result of evaluating a saved checkpoint from each epoch as well as the base Qwen1.5B model with few-shot examples or without.
This work explored, albeit, in a limited scope, how much RL training can improve the performance of LLMs on NER task and the results show impressive performance for small-sized LLMs.
It's important to emphasize (and this are my current thoughts) that RL training benefits from a good base pretrained model, as the current RL training paradigm encourage exploitation more than they do exploration. It's useful to think of the current RL training paradigm as stabilizing the distribution over the domain of interest, so if the model cannot, under any inference sampling, over number of generations that tend to infinity, sample the answer, most likely it cannot be learned during an RL training paradigm. This intuition is informed by the performance of LLMs on math and code tasks under base + SFT + RL vs only base and RL.
Future work would investigate other NER datasets (in other domains), how much performance is lost or retained in other tasks/benchmarks of interest, different structures to reward model as well as how small can the models be to get competitive performance, especially in real world applications, where the tradeoff of efficiency and correctness is often the focal point.
It would also be interesting to do Interpretability on these models. What about the model changes when it's RL trained
Gratitude goes to the Institute of Software Engineering at Innopolis university, led by Professor Vladmir Ivanov for providing compute. I am also grateful to the following persons:
Cite as:
@article{israel2025ner_rl, title = "Reinforcement Learning Meets NER", author = "Israel, Adewuyi", journal = "israel-adewuyi.github.io", year = "2025", month = "May", url = "https://israel-adewuyi.github.io/blog/2025/ner_with_rl/" }