Llama-Nemotron is an open household of heterogeneous reasoning fashions out there in Nano (8B), Tremendous (49B), and Extremely (253B) sizes, designed for distinctive reasoning capabilities and environment friendly inference. Llama-Nemotron fashions are the primary open-source fashions to assist a dynamic reasoning toggle, permitting customers to modify between normal chat and reasoning modes throughout inference.
The LN-Tremendous and LN-Extremely fashions are optimized for environment friendly inference utilizing the Puzzle framework. Puzzle is a neural structure search (NAS) framework that transforms giant language fashions into hardware-efficient variants below real-world deployment constraints. Ranging from a Llama 3 Instruct mannequin (Llama 3.3–70B-Instruct for LN-Tremendous and Llama 3.1–405B-Instruct for LN-Extremely), Puzzle applies block-wise native distillation to construct a library of different transformer blocks. Every block is educated independently and in parallel to approximate the operate of its dad or mum block whereas enhancing computational properties corresponding to latency, reminiscence utilization, or throughput with a sure accuracy-efficiency tradeoff.
The block variants embody:
- Consideration removing: Some blocks omit the eye mechanism solely, decreasing each compute and KV-cache reminiscence consumption.
- Variable FFN dimensions: The feed-forward community’s intermediate measurement is diversified, enabling compression at completely different granularity ranges (e.g., 87%, 75%, 50%, all the way down to 10% of the unique hidden measurement).
Whereas Puzzle helps further operations — together with grouped-query consideration (GQA) with completely different numbers of key-value heads, linear alternate options to consideration, and no-op substitutions — empirical analysis confirmed that focus removing and FFN compression have been the simplest for optimizing the LN-Tremendous and LN-Extremely fashions when it comes to general throughput and reminiscence financial savings.
Puzzle assembles an entire mannequin by deciding on one block per layer. This choice is ruled by a mixed-integer programming (MIP) solver that identifies essentially the most environment friendly configuration below a given set of constraints, corresponding to {hardware} compatibility, most allowed latency, whole reminiscence funds, or desired inference throughput.
For the LN-Extremely mannequin, a further compression method known as FFN Fusion is launched, designed to scale back sequential depth and enhance inference latency. This method leverages a structural property that emerges after Puzzle removes some consideration layers: the mannequin usually comprises consecutive FFN blocks. FFN Fusion identifies such sequences and replaces them with fewer, wider FFN layers that may be executed in parallel. This reduces the variety of sequential steps with out compromising expressivity, and considerably improves compute utilization particularly on multi-GPU setups the place inter-layer communication overhead is non-negligible.
Following the NAS part, each LN-Tremendous and LN-Extremely endure further coaching to enhance inter-block compatibility and recuperate any high quality loss launched throughout blockwise substitution.
- LN-Tremendous is educated for 40B tokens utilizing a data distillation goal over the Distillation Combine dataset launched by Puzzle.
- LN-Extremely is first educated with data distillation for 65B tokens utilizing the identical distillation dataset, adopted by 88B tokens of continued coaching on the Nemotron-H part 4 pretraining dataset.
This last pretraining step permits LN-Extremely to not solely match however surpass the reference mannequin Llama 3.1–405B-Instruct in key benchmarks.
Information for supervised fine-tuning is curated in each reasoning and non-reasoning classes. Reasoning samples embody the system instruction “detailed considering on”. Non-reasoning samples make the most of the instruction “detailed considering off”.
Math
To assemble the mathematics reasoning portion of the info, the pipeline described by OpenMath Nemotron is used. DeepSeek-R1 and Qwen2.5-Math-7B-Instruct are prompted to unravel every downside a number of instances, producing “reasoning” and “non-reasoning” options respectively. 16 generations per downside are used for DeepSeek-R1 and 64 generations per downside for Qwen2.5-Math-7B-Instruct. As the ultimate filtering step, any options that don’t attain the anticipated reply are eliminated. Predicted and anticipated solutions are in contrast by prompting Qwen2.5–32B- Instruct to guage their equivalence within the context of the issue.
Code
The code reasoning dataset is constructed by way of a multi-stage course of involving query assortment, resolution era, and post-processing steps, as described by OpenCodeReasoning. DeepSeek-R1 is used to generate a number of options per query, primarily in Python, with C++ options additionally generated for particular benchmark testing.
Science
A various set of open-ended and multiple-choice questions (MCQs) are curated from each in-house and exterior sources. These embody question-answer pairs extracted from StackOverflow and synthetically generated MCQ questions. Artificial questions are created by defining a broad set of educational subjects (e.g., physics, biology, chemistry) and their subtopics utilizing Nemotron-4–340B- Instruct. A number of problem ranges are specified to make sure a various and scalable dataset. Qwen2.5 fashions are prompted to generate MCQs conditioned on the subject, subtopic, and problem degree. Every query is verified for format compliance. The dataset is augmented by prompting Qwen2.5 to generate variations of the unique questions, following the OpenMathInstruct-2 pipeline. For all questions within the dataset, DeepSeek-R1 is used to generate a number of reasoning traces. For questions with out ground-truth solutions, the most certainly appropriate reply is inferred by making use of majority voting throughout generated options.
Basic
For normal area information, the era pipeline established in Nemotron-4 340B is adopted. For responses, DeepSeek-R1 is prompted for a number of generations and rejection sampling is carried out utilizing the Llama-3.1-Nemotron-70B reward mannequin.
Reasoning off
To coach the mannequin to comply with the reasoning toggle instruction, paired information is constructed the place every immediate has each a reasoning response and a non-reasoning response. Particularly, prompts are randomly sampled from the reasoning datasets above and corresponding non-reasoning responses are generated utilizing Llama-3.1-Nemotron-70B-Instruct for normal area prompts and Llama-3.3–70B-Instruct for others.
Basic-Area Open-ended Inference-Time Scaling
To generate high-quality general-domain open-ended responses, Llama-3.1-Nemotron-70B- Instruct is employed along side a novel Suggestions-Edit Inference-Time-Scaling system. The method begins with 20k first-turn prompts sourced from ShareGPT and WildChat-1M. Llama-3.1-Nemotron-70B- Instruct generates a number of preliminary responses for every immediate. These responses are refined by a three-stage course of: a devoted Suggestions mannequin identifies areas for enchancment, a devoted Edit mannequin makes focused edits primarily based on the suggestions, and a devoted Choose mannequin chooses the most effective edited response. The ensuing dataset contains 20k first-turn prompts and their corresponding high-quality responses.
All fashions are educated utilizing a token-level cross-entropy loss over the instruction-tuning information.
LN-Nano undergoes a three-stage SFT pipeline. Within the first stage, the mannequin is fine-tuned completely on reasoning information from code, math, and science domains with a studying fee of 1e−4 for 4 epochs. This prevents failure modes corresponding to repetitive completions. Within the second stage, non-reasoning information is launched combined with reasoning samples, permitting the mannequin to be taught reasoning management. Within the last stage, a smaller mix targeted on chat, instruction-following, and tool-calling is used.
LN-Tremendous is educated on the complete SFT dataset for a single epoch utilizing a hard and fast studying fee of 5e−6, sequence size of 16k and a world batch measurement of 256. Smaller-scale runs recommend that efficiency improves as much as 3–4 epochs with bigger studying charges (5e−5), however coaching was constrained by computational and deadlines.
LN-Extremely is educated on the complete dataset utilizing sequence packing with an efficient sequence size of 24k. Preliminary ablation runs indicated that greater studying charges corresponding to 5e−5 typically enhance outcomes, however persistently excessive studying charges brought on instability, together with gradient explosions. To mitigate this, a linear warmup to 1e−5, adopted by cosine decay to 1e−6 with a warmup ratio of 10% is applied. Regardless of these measures, coaching encountered gradient explosions and numerical instability after the primary epoch. This required coaching resumption with reinitialized optimizer states, after which profitable convergence was achieved.
Utilizing supervised fine-tuning, LN-Extremely can method the efficiency of DeepSeek-R1 however not exceed it. To allow college students to surpass their academics, large-scale reinforcement studying is a viable method, because it permits the mannequin to repeatedly discover new potentialities and have interaction in self-learning.
Preliminary experiments point out that making use of RL to smaller fashions yields suboptimal outcomes in comparison with distillation. As a consequence of useful resource constraints, reasoning RL is just utilized to LN-Extremely, which leads to a mannequin that outperforms its instructor, leveraging the Group Relative Coverage Optimization (GRPO) algorithm.
On this coaching part, two forms of rewards are used:
- Accuracy rewards: For every coaching instance, a floor fact reply (a quantity, a sentence, or a paragraph) is offered. The Llama-3.3–70B-Instruct mannequin is used to guage whether or not the coverage’s predictions match the bottom fact reply.
- Format rewards: a format reward is employed to make sure the mannequin places its considering course of between “
” and “ ” tags when utilizing “detailed considering on” mode. We additionally examine for the non-existence of considering tags when utilizing “detailed considering off” mode.
To make sure that the mannequin is sufficiently challenged, the info is preprocessed by independently producing 8 responses per query utilizing LN-Tremendous, calculating the move fee, after which deliberately discarding prompts with a move fee of 0.75 or greater.
Curriculum coaching can also be discovered to be useful, because it permits the mannequin to step by step be taught from a development of duties with growing problem.
After coaching for scientific reasoning, a brief RL run optimizes instruction following capabilities for the LN-Tremendous and LN-Extremely. RL is run with the RLOO algorithm, utilizing the instruction following verifier as a reward. Such coaching boosts efficiency on typical instruction following benchmarks in addition to reasoning benchmarks.
RLHF is used to enhance the mannequin on normal helpfulness and chat capabilities whereas rigorously sustaining its proficiency in different areas.
For LN-Tremendous, iterative on-line RPO is used to maximise the reward predicted by Llama-3.1-Nemotron-70B-Reward over prompts from HelpSteer2.
The identical course of is adopted for LN-Extremely, besides that GRPO is employed.
For LN-Nano, two rounds of offline RPO with on-policy information are carried out. A combination of reasoning and non-reasoning information with applicable system prompts is used within the first spherical of RPO to enhance reasoning management, adopted by a second spherical with on-policy generations focusing on instruction following enhancements.