Reinforcement Learning Meets NER

an attempt at solving Named Entity Recognition with RL training.

TLDR

This work represents preliminary experimental reports, and should be treated as such. See the closing thoughts for more details.

Via RL training, we achieved up to 7.3% increase in F1 score on a 1.5B model, on an NER task, compared to 175B GPT3-based baselines.
The RL trained model underperforms some other approaches, mostly involving some form of SFT.
We offer some closing thoughts on this as well as future possible directions of research.

Feel free to skip to Methods section or start from the introduction below.

Code is up at this repo

Introduction

Large Language Models (LLMs) built on the Transformer architecture have transformed Natural Language Processing (NLP), achieving SOTA results in tasks such as text generation, translation, and sentiment analysis . At the same time, Named Entity Recognition (NER)—the process of identifying and classifying proper names and other key terms in text—remains a core NLP task in applications like information extraction, question answering, machine translation .

Recently, researchers have revisited Reinforcement Learning (RL) as a means of adapting LLMs to specific objectives without full retraining. By defining reward functions or training reward models that is specific to the domain/task in question, RL fine‑tuning can elicit desired behaviors from a pre‑trained model. This approach has already shown promise in areas such as competitive mathematics and code generation .

In this work, we bring these threads together. We fine‑tune a 1.5B Qwen2.5 model on the CoNLL2003 NER , using carefully designed reward signals to guide entity recognition performance. Our results demonstrate that, even with modest model size, RL‑based adaptation can rival much larger architectures such as GPT‑3, highlighting the potential of reinforcement learning for structured NLP tasks.

Background

Why LLMs suck at NER

LLMs like Qwen2.5-1.5 , Llama , Gemini , e.t.c are pretrained on massive datasets with the objective being to predict the next token in a sequence. This makes them great for tasks like text generation but less effective for NER.

Why?

NER requires a different approach, as it involves identifying and classifying entities within text. This is a token-level task, where the goal is to label specific tokens accurately.

Imed et. al. provided a comprehensive overview of the advances in the field of NER, but most relevant approaches to this work, which are summarized below, are approaches that leverage LLMs to solve NER task.

Summary of related approaches

The core idea in most of the works listed below involves primarily prompt engineering Which is just a fancy way of saying almost all of them reformulated the task in some way that the LLM might be able to solve easier. and some combination of In Context Learning (ICL) and Supervised Finetuning (SFT). In the ICL paradigm, LLMs learn new tasks by being shown a few examples (few-shot) in the prompt which makes them flexible for new tasks without extra training while in the SFT paradigm, an LLM is further trained on specific, labeled data to make it better at the said task. It's akin to fine-tuning a general tool for a specific job.

GPT-NER introduced by Wang et al and LTNER introduced by Yan et al reformulated the sequence labelling task as a text generation task by prompting the LLM to generate the input text with the identified entities marked by special tokens. Both methods rely heavily on ICL. For retrieving these few-shot examples, GPT-NER investigated various strategies including embeddings derived from a fine-tuned NER model while LTNER utilizes vector-based retrieval from a knowledge base to find the most relevant examples for contextual learning. Additionally, GPT-NER introduced a self-verification strategy to combat hallucination.

PromptNER introduced by Ashok and Lipton kept the task as a sequence labelling task By prompting the LLM to list the entities in the text, given a predefined list of entities, but they introduced Chain-of-Thought (CoT) prompting, as well as giving an explanation of all the predefined entity types.

GoLLIE introduced in and InstructUIE introduced in both proposed instruction tuning frameworks for information retrieval using LLMs. GoLLIE fine-tunes an LLM to follow annotation guidelines, with tasks and guidelines represented in a code-based format. In contrast, InstructUIE employs natural language instructions within a unified text-to-text framework to model various IE tasks.

CodeIE and Code4UIE also transform the sequence labelling task into a code generation task to leverage the code generation capabilities of LLMs.

Method	Approach	Model size
GPT-NER	ICL	Text-davinci-003
LTNER	ICL	GPT-3.5-turbo
PROMPT-NER	ICL	GPT4
CodeIE	ICL	Code-davinci-002
Code4UIE	ICL	Text-davinci-003
GPT-NER	ICL + SFT	Text-davinci-003
GoLLIE	SFT	Code-LLaMA 34B
InstructUIE	SFT	Flan-T5-11B

Method

Dataset

The CoNLL2003 introduced by Eric et. al. has four types of named entities: Location (LOC), Organization (ORG), Person (PER), and Miscellaneous (MISC). We leveraged the preprocessed NER dataset by Li et al. (2019a) . A sample from the dev set which is downloadable from their github repo looks like

  {
    "context": "4 - Goran Ivanisevic ( Croatia ) beat Scott Draper ( Australia ) 6-7 ( 1-7 ) 6-3 6-4 6-4",
    "end_position": [
      3,
      9
    ],
    "entity_label": "PER",
    "impossible": false,
    "qas_id": "174.2",
    "query": "person entities are named persons or family.",
    "span_position": [
      "2;3",
      "8;9"
    ],
    "start_position": [
      2,
      8
    ]
  }

context - the input text from which entities are to be extracted.
entity_label - the entity to be extracted
query - an explanation of the entity to be extracted.

Prompt

  """
  A conversation between User and Assistant. The User provides a string of words. 
  The task of the Assistant is to identify all the {entity_label} entities 
  in the given string and return the entities surrounded by an entity tag.
  DESCRIPTION: {query}
  
  The reasoning process should be enclosed within <think> </think> tags, 
  and the relevant words should be enclosed within <entity> </entity> tags.
  i.e <think> reasoning process here </think> <entity> comma separated 
  list of words that are locations</entity>
  
  {example}
  
  User: {context}
  Assistant: 
  """

entity_label - the entity to be extracted
query - an explanation of the entity to be extracted - same as in the dataset.
example - few shot examples for the current entity_label
context - the input text from which entities are to be extracted.

The prompt construction method used in this work relies heavily and closely mirrors the fantastic work done by in the research literature. An overview of the relevant parts are provided below:

Task description

Following the preprocessing step done by Eric et. al. and Wang et al. , for each input sentence, N prompts are constructed where N is the number of entities in the dataset N = 4 for CoNLL2003. Following the PromptNER paper, we asked the model to generate a CoT inside the think tag and we also added an explanation for each entity type to the prompt.

Few shot demonstrations

To generate in-context examples, we randomly sample 2 sentences from the dev set of the CoNLL dataset for each entity type Deepseek-V3-base is prompted to solve the task and the result is manually inspected and included in the prompt, without any modifications.

LLM Output Format.

As visible from the prompt template, the LLM is prompted to output the CoT inside the think tag and then the list of entities in the entity tag.

As an example, consider the following. The LLM is prompted to identify the location entity, LOC in the following text: Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday . Results from Deepseek-V3-base :

  <think>  1. **Japan**: This is a country in East Asia, which is a politically and geographically defined location. It fits the criteria for a LOCATION entity. 
  2. **Asian Cup**: This refers to a football tournament, not a specific geographic or political location, so it does not qualify as a LOCATION entity. 
  3. **Syria**: This is a country in the Middle East, which is a politically and geographically defined location. It fits the criteria for a LOCATION entity. 
  4. **Group C**: This refers to a group in the tournament, not a geographic or political location, so it does not qualify as a LOCATION entity. 
  5. The rest of the words ("began", "the", "defence", "of", "their", "title", "with", "a", "lucky", "2-1", "win", "against", "in", "a", "championship", "match", "on", "Friday") are not location names and do not fit the criteria. 
  </think>  
  <entity> Japan, Syria</entity>

RL and Reward Design

To guide the model to output things in the desired format, we utilize GRPO which discards the critic model which is meant to provide some baseline for the advantage estimates. Instead, for each prompt, GRPO samples m outputs, which are referred to as group and the average outcome reward from the group serves as an estimate for the baseline.

For each output in the group, four reward functions were designed to provide signals to the policy model. The design of the reward function was inspired by the amazing work of Willccb's grpo demo.

Soft Format Reward: Awards 0.5 if the output follows the required format (<think>...</think> <entity>...</entity>), ensuring structural consistency.
Correctness Reward: Gives 2.0 if the extracted entities exactly match the ground truth, emphasizing accuracy.
Positive Entity Correctness Reward: Adds 0.5 for each correctly identified entity, rewarding partial correctness.
Negative Entity Correctness Reward: Subtracts 0.5 for each incorrectly included entity, penalizing overprediction.

These rewards work together to encourage the model to identify entities accurately while adhering to the expected format. For example, if the model correctly identifies “Japan, Syria” as Locations but includes an extra incorrect entity, it receives a positive reward for the correct entities but a penalty for the mistake.

Example Reward Calculation:

Input: “Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria.”

Ground Truth: <entity>Japan, Syria</entity>

Model Output: <entity>Japan, Syria, Asian Cup</entity>

Soft Format: 0.5 (correct format)
Correctness: 0.0 (not an exact match)
Positive Entity: 1.0 (0.5 for Japan + 0.5 for Syria)
Negative Entity: -0.5 (penalty for Asian Cup)

Experiment

As stated earlier, Qwen2-1.5B-Instruct was utilized in this experiment. TRL library provided by HuggingFace, alongsides modifications to Willccb's grpo demo repository, were used to construct a training and eval pipeline. The model was also trained on 4 epochs on 2 Nvidia A100 GPUs. Relevant hyperparameters are:

  training_args = GRPOConfig(
    ...
    learning_rate=5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type='cosine',
    bf16=True,
    per_device_train_batch_size=4,
    num_generations=8,
    max_prompt_length=2048,
    max_completion_length=2048,
    num_train_epochs=4,
    save_strategy="epoch",
    max_grad_norm=0.1,
    report_to="wandb",
    log_on_each_node=False,
  )

More details can be found in the repository.

Evaluation

F1 score is reported for the RL-trained model.

To compare with other methods, the best F1 score evaluation across 4 epochs is reported.
VLLM was used for sampling from the transformers.
To ensure robustness, F1 score across 16 generations are averaged and reported, both in comparison to other methods and in comparison across epochs.
To evaluate the RL-trained model, few shot examples were included in the prompt.

Results / Charts

F1 score comparison across approaches

The chart below shows the F1 score of the fine-tuned model in comparison with other methods described in the relevant works section.

Model sizes comparison

The chart below compares the model sizes across all the methods listed above

Notes:

Parameter count of code-davinci-003 isn't publicly known, but inferring from the naming convention, it's based on the GPT3 model, which is reported to be 175B.
GPT3.5-Turbo also doesn't have it's parameter count publicly known. But based on the the GPT3 model count, 175B is conservative.
The same logic applies to GPT4. Though, it's rumoured to be around 1.7T parameters. So 175 is also conservative.

F1 score comparison across epochs

The chart below shows the result of evaluating a saved checkpoint from each epoch as well as the base Qwen1.5B model with few-shot examples or without.

Closing thoughts

This work explored, albeit, in a limited scope, how much RL training can improve the performance of LLMs on NER task and the results show impressive performance for small-sized LLMs.

It's important to emphasize (and this are my current thoughts) that RL training benefits from a good base pretrained model, as the current RL training paradigm encourage exploitation more than they do exploration. It's useful to think of the current RL training paradigm as stabilizing the distribution over the domain of interest, so if the model cannot, under any inference sampling, over number of generations that tend to infinity, sample the answer, most likely it cannot be learned during an RL training paradigm. This intuition is informed by the performance of LLMs on math and code tasks under base + SFT + RL vs only base and RL.

Future work would investigate other NER datasets (in other domains), how much performance is lost or retained in other tasks/benchmarks of interest, different structures to reward model as well as how small can the models be to get competitive performance, especially in real world applications, where the tradeoff of efficiency and correctness is often the focal point.

It would also be interesting to do Interpretability on these models. What about the model changes when it's RL trained

Acknowledgements

Gratitude goes to the Institute of Software Engineering at Innopolis university, led by Professor Vladmir Ivanov for providing compute. I am also grateful to the following persons:

Ilnur Khadiev for helping out with fixing GPU-related issues every so often.
Nursultan Abdullaev for providing feedback on an initial draft.

Citation information

Cite as:

          Israel, Adewuyi. (May, 2025). Reinforcement Learning meets NER
          https://israel-adewuyi.github.io/blog/2025/ner_with_rl/.
        


  @article{israel2025ner_rl,
    title   = "Reinforcement Learning Meets NER",
    author  = "Israel, Adewuyi",
    journal = "israel-adewuyi.github.io",
    year    = "2025",
    month   = "May",
    url     = "https://israel-adewuyi.github.io/blog/2025/ner_with_rl/"
  }