It’s broadly believed that RLVR allows LLMs to repeatedly self-improve, thus buying novel reasoning skills that exceed corresponding base fashions’ capability. Nevertheless, this assumption is critically re-examined by measuring the move@okay metric with massive values of okay to discover the reasoning functionality boundary of the fashions throughout a variety of mannequin households, RL algorithms and math/coding benchmarks.
TL;DR:
- Whereas RL-trained fashions outperform their base fashions at smaller values of okay (e.g., okay=1), base fashions can obtain a comparable and even increased move@okay rating in comparison with their RL counterparts at massive okay values.
- Additional evaluation reveals that the reasoning paths generated by RL-trained fashions are already included within the base fashions’ sampling distribution, suggesting that the majority reasoning skills manifested in RL-trained fashions are already obtained by base fashions.
- RL coaching boosts the efficiency by biasing the mannequin’s output distribution towards paths which are extra more likely to yield rewards, due to this fact sampling appropriate responses extra effectively.
- Nevertheless, this additionally limits their exploration capability, leading to a narrower reasoning functionality boundary in comparison with base fashions.
- Related outcomes are noticed in visible reasoning duties educated with RLVR.
- Furthermore, it’s discovered that distillation can genuinely introduce new data into the mannequin.
The undertaking is accessible at GitHub.
The evaluation is organized by activity class, overlaying three consultant domains: arithmetic, code era, and visible reasoning. For all sampling procedures involving each base and RL-trained fashions, a temperature of 0.6 and a top-p worth of 0.95 are used, permitting a most era of 16,384 tokens.
RLVR for Mathematical Reasoning
- In contrast the efficiency of base LLMs (Qwen-2.5 and LLaMA-3.1–8B) with their RLVR-trained counterparts (educated utilizing GRPO on GSM8K and MATH datasets).
- Evaluated fashions utilizing move@okay (the likelihood of producing an accurate reply inside okay makes an attempt) on numerous math benchmarks (GSM8K, MATH500, Minerva, Olympiad, AIME24, AMC23).
- Included an extra comparability with Oat-Zero-7B, an RL mannequin educated utilizing the Oat-Zero framework.
- RLVR will increase the chance of sampling appropriate solutions when okay is small (e.g., okay=1, equal to average-case accuracy).
- RLVR narrows the mannequin’s total problem-solving protection, as evidenced by base fashions outperforming RL fashions at bigger okay values.
RLVR for Code Technology
- Mannequin: Code-R1 (particularly CodeR1-Zero-Qwen2.5–7B) educated with RLVR utilizing a binary correctness reward based mostly on predefined take a look at circumstances. The mannequin was based mostly on Qwen2.5–7B-Instruct-1M and educated on 12K LeetCode and TACO samples.
- Analysis: Efficiency is assessed on three code era benchmarks: LiveCodeBench v5 (880 issues), HumanEval+, and MBPP+.
- RLVR improves single-sample efficiency (move@1) in code era duties, much like its impact on mathematical reasoning duties.
- RLVR negatively impacts the reasoning boundary or protection of the mannequin. Whereas the unique mannequin reveals potential for fixing extra issues with elevated sampling (okay), the RLVR-trained mannequin plateaus. Particularly, at okay=128, the unique mannequin solves ~50% of issues whereas the RLVR mannequin solves solely ~42.8% on LiveCodeBench.
- Though RLVR enhances preliminary efficiency, it limits the mannequin’s potential to resolve a wider vary of issues in comparison with the unique mannequin when permitting for a number of resolution makes an attempt. This means a trade-off between single-sample accuracy and exploration functionality.
RLVR for Visible Reasoning
- Mannequin: Qwen-2.5-VL-7B (a vision-language mannequin) educated utilizing the EasyR1 framework on Geometry3K dataset.
- Analysis Knowledge: Filtered variations of MathVista-TestMini and MathVision-TestMini, excluding multiple-choice inquiries to keep away from guessing bias. The filtering resulted in 460 issues from MathVista and 114 issues from MathVision.
- RLVR persistently improves the visible reasoning efficiency of the LLM, much like its results on math and coding benchmarks.
- The advance is attributed to broader protection of solvable questions, that means the mannequin can clear up a wider vary of issues after RLVR coaching.
- Guide inspection of CoT in difficult issues signifies that the elevated efficiency is as a result of mannequin studying legitimate reasoning paths, fairly than random guessing. Particularly, for each the unique and RL fashions, 7 out of 8 inspected issues had a minimum of one appropriate CoT resulting in the correct reply. This validates the effectiveness of the CoT strategy in bettering reasoning skills.
Reasoning Patterns Already Current in Base Fashions
In contrast the set of solvable issues for base fashions and their corresponding RL-trained variations on AIME24 (math issues) and coding duties.
Carried out perplexity evaluation: measured the perplexity of responses generated by the bottom mannequin (PPLBase) for responses generated by the RL-trained mannequin (YRL) and the bottom mannequin itself (YBase), and in contrast them to responses from a stronger mannequin (OpenAI-o1, YGT).
- RLVR doesn’t introduce new reasoning skills: The RL-trained fashions don’t exhibit reasoning capabilities past these already current within the base fashions. The reasoning paths exploited by the RL mannequin exist already throughout the base mannequin’s output distribution. That is supported by the perplexity evaluation displaying that the RL mannequin’s responses are extremely more likely to be generated by the bottom mannequin.
- RLVR improves sampling effectivity: Whereas not introducing new capabilities, RLVR improves the chance of sampling appropriate reasoning paths already current within the base mannequin, main to raised efficiency by way of move@1.
- RLVR narrows the reasoning boundary: The improved sampling effectivity comes at the price of decreased exploration and variety within the generated responses, resulting in decrease move@okay (fixing issues inside okay makes an attempt) for bigger values of okay. That is attributed to RL’s tendency to scale back output entropy.
Distillation Expands the Reasoning Boundary
Distillation of a big reasoning mannequin (DeepSeek-R1) right into a smaller base mannequin (Qwen-2.5-Math-7B). Comparability of the efficiency of the distilled mannequin (DeepSeek-R1-Distill-Qwen-7B) with:
- the bottom mannequin (Qwen-2.5-Math-7B)
- its RL-trained counterpart (Qwen-2.5-Math-7B-Oat-Zero)
- an instruction-tuned mannequin (Qwen-2.5-Math-7B-Instruct)
- Distillation considerably improves the reasoning capabilities of the bottom mannequin.
- In contrast to RL, which is restricted by the bottom mannequin’s reasoning capability, distillation introduces new reasoning patterns realized from the stronger instructor mannequin, permitting the distilled mannequin to surpass the restrictions of the bottom mannequin.
Results of Completely different RL Algorithms
- Algorithms: A number of standard RL algorithms (PPO, GRPO, Reinforce++, RLOO, ReMax, DAPO) had been re-implemented utilizing the VeRL framework.
- Dataset: Omni-MATH-Rule dataset is break up into coaching and in-domain take a look at units. MATH500 is used because the out-of-domain benchmark.
- Metric: Sampling Effectivity Hole (∆SE) is outlined because the distinction between the RL-trained mannequin’s move@1 and the bottom mannequin’s move@256. Decrease ∆SE signifies higher sampling effectivity.
- Common Efficiency: Completely different RL algorithms confirmed minor variations in move@1 and move@256, however none considerably closed the Sampling Effectivity Hole (∆SE). ∆SE remained above 40 factors throughout all algorithms.
- DAPO: Achieved barely increased move@1 scores however required considerably extra samples per batch (3–6x) throughout coaching and efficiency dropped significantly at move@256.
- RLOO and Reinforce++: Carried out persistently nicely throughout completely different values of okay (1 to 256) with environment friendly coaching prices, providing a superb steadiness between effectiveness and effectivity.
- ReMax: Confirmed decrease efficiency, probably as a result of instability attributable to the binary and extremely variable reward used because the benefit baseline.
Asymptotic Results of RL Coaching
The modelias educated utilizing RL with various numbers of coaching steps (e.g., 150, 450). Efficiency is evaluated utilizing move@1 (precise match accuracy) and move@256 (accuracy inside high 256 candidates) metrics on coaching, in-domain take a look at, and out-of-domain take a look at units.
- Rising RL coaching steps improves move@1 on the coaching set considerably (from 26.1 to 42.5).
- Nevertheless, the advance in move@1 on in-domain and out-of-domain take a look at units is marginal past 150 steps, suggesting potential overfitting to the coaching set.
- Rising coaching steps results in a lower in move@256 throughout all datasets, with the bottom efficiency at 450 steps. This means a decreased reasoning boundary and exploration capability as coaching progresses, probably on account of lowering output entropy.
- Longer RL coaching (past 150 steps) might not present substantial advantages and would possibly even hinder efficiency on account of overfitting and decreased exploration.
Does Reinforcement Studying Actually Incentivize Reasoning Capability in LLMs Past the Base Mannequin? 2504.13837