GRPO (Group Relative Policy Optimization), explained

Jun 11, 2026

GRPO is the reinforcement learning method behind DeepSeek-R1 and a wave of reasoning models. Its trick: throw away PPO’s value network and score each response against a group of samples for the same prompt. Here is how it works, how it compares to PPO, and where it runs out of road.

What is GRPO?

GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm for training large language models. It was introduced by DeepSeek in the DeepSeekMath paper (2024) and brought to wide attention by DeepSeek-R1 (January 2025), which used it to train a frontier reasoning model largely through RL.

The core idea is simple. GRPO is closely related to PPO (Proximal Policy Optimization), but it removes PPO’s separately trained value model, or critic. Instead of asking a learned network to estimate how good a response is, GRPO samples a group of responses to the same prompt, scores each one, and judges every response relative to the group average. A response that beats the group mean is reinforced; one that lags behind is discouraged.

2024

Introduced
in the
DeepSeekMath
Paper

2025

Scaled by
Deep-Seek R1,
a watershed for
RL Reasoning

Group-relative

advantage from
a group of samples,
not a learning
value function

How GRPO works

For each training prompt, GRPO runs four steps:

Sample a group. The current policy generates several responses to the same prompt (a group, often 8 to 64 samples).
Score each response. A reward signal scores every response. For verifiable tasks like math or code this can be a simple correctness check (often called RLVR, reinforcement learning from verifiable rewards); elsewhere it can be a reward model.
Compute group-relative advantage. Each response’s reward is normalized against the group (subtract the group mean, divide by the group standard deviation). That normalized score becomes the advantage, with no critic network involved.
Update the policy. Using a PPO-style clipped objective and a KL penalty toward a reference model, the policy is nudged toward responses that scored above the group average.

Because the baseline comes from the group rather than a learned value function, GRPO needs only two model copies in memory (the policy and a frozen reference) instead of PPO’s three (policy, reference, and value model). That is a meaningful cut in compute and memory for large models.

GRPO vs. PPO

How GRPO differs from the PPO setup it is based on.

Dimension	PPO	GRPO
Advantage estimate	Learned value model (critic)	Group-relative: reward normalized within a group of samples
Models in memory	Policy, reference, value model	Policy and reference only
Compute / memory	Higher (extra critic to train)	Lower (no critic)
Best fit	General RL, dense or shaped rewards	Verifiable rewards, reasoning, math, code
Main cost	Training and tuning the critic	Sampling a group per prompt

Why GRPO matters

GRPO made large-scale RL for reasoning practical. By dropping the critic, it removed a major source of instability and cost in RLHF-style pipelines, and it pairs naturally with verifiable rewards, where correctness is cheap to check and hard to game. DeepSeek-R1 showed that this recipe could push reasoning performance to the frontier, and GRPO has since become a default starting point for teams training reasoning and agentic models.

Where GRPO struggles: long-horizon credit assignment

GRPO works beautifully when a task is a single response with one verifiable outcome. It struggles when a task is a long sequence of actions, which is exactly the setting of modern AI agents.

The reason is the credit assignment problem. Standard RL, GRPO included, typically assigns one reward at the end of a long task. If an agent takes 50 steps and fails, every step receives the same negative signal, even the steps that were correct. The model cannot tell which action actually caused the failure, so the gradient is noisy and training on long, multi-step tasks becomes unstable, often converging slowly or not at all without extra structure.

The analogy: imagine running a six-month project, and at the end your manager says only “the project failed,” with no check-ins along the way. You would have no idea which decision to fix. That is standard RL on a long task, and it is the gap newer methods aim to close.

Beyond GRPO: milestone and step-level credit assignment

A line of recent research extends GRPO’s group-relative idea with finer-grained credit, so that an agent earns reward at intermediate milestones rather than only at the end. The shared principle is to replace one global guess with many local signals.

One notable example is GiGPO (Group-in-Group Policy Optimization, NeurIPS 2025), which keeps GRPO’s critic-free design but adds a second, inner level of comparison: an outer group compares whole episodes, while an inner group compares actions taken at the same task checkpoint. That yields step-level credit without a value model or extra rollouts, and the paper reports gains of more than 12% over GRPO on ALFWorld and more than 9% on WebShop at the same memory cost. Other approaches partition a trajectory into milestone-bounded segments with a local baseline per segment, or use a hierarchical planner and executor that each train on their own reward.

The takeaway for anyone training agents: the algorithm is only half the story. Milestone-based methods need training data and environments that actually expose intermediate structure, per-phase reward signals, milestone boundaries, and partial-progress credit. The structure of the data increasingly matters as much as its quantity.

How Snorkel AI fits in

Snorkel AI builds the expert data and environments that make advanced RL methods work, including the milestone-structured environments and per-phase evaluation that GRPO’s successors depend on. That is the same focus reflected in Snorkel’s work on agent evaluation, from Agents’ Last Exam and JudgmentBench to Terminal-Bench Science.

For models that need to be right. Not just good enough.

Talk to a researcher

Explore research