This work curates a small dataset s1K of 1,000 questions paired with reasoning traces counting on three standards validated by ablations: issue, range, and high quality. Second, price range forcing is developed to regulate test-time compute by forcefully terminating the mannequin’s pondering course of or lengthening it by appending “Wait” a number of instances to the mannequin’s era when it tries to finish. This may lead the mannequin to double-check its reply, usually fixing incorrect reasoning steps.
The mission is accessible on GitHub.
Preliminary assortment of 59K samples
An preliminary 59,029 questions are collected from 16 sources, following three guiding rules. Datasets needs to be high-quality; samples are all the time inspected and datasets with, e.g., poor formatting are ignored. Datasets needs to be difficult and require important reasoning effort. Datasets ought to stem from numerous fields to cowl completely different reasoning duties.
All samples are decontaminated in opposition to analysis questions (MATH500, GPQA Diamond, AIME24) utilizing 8-grams and the info is deduplicated.
Closing collection of 1K samples
Three levels of filtering are used to reach at a minimal set of 1,000 samples based mostly on three guiding information rules: High quality, Issue, and Range.
First, any questions the place API errors occurred are eliminated, decreasing the dataset to 54,116 samples. Subsequent, low-quality examples containing string patterns with formatting points, comparable to ASCII artwork diagrams, non-existent picture references, or inconsistent query numbering are filtered out, decreasing the dataset to 51,581 examples. From this pool, 384 samples are recognized for the ultimate 1,000 samples from datasets perceived as high-quality and never in want of additional filtering.
For issue, two indicators are used: mannequin efficiency and reasoning hint size. Two fashions, Qwen2.5–7B-Instruct and Qwen2.5–32B-Instruct, are evaluated on every query, with correctness assessed by Claude 3.5 Sonnet evaluating every try in opposition to the reference resolution. The token size of every reasoning hint is measured utilizing the Qwen2.5 tokenizer to point drawback issue, assuming that tougher issues require extra pondering tokens. Questions that both Qwen2.5–7B-Instruct or Qwen2.5–32B-Instruct can resolve accurately are eliminated, as they might be too simple, bringing the full samples all the way down to 24,496.
To quantify range, questions are categorized into domains utilizing Claude 3.5 Sonnet based mostly on the Arithmetic Topic Classification (MSC) system from the American Mathematical Society. To pick the ultimate examples from the pool of 24,496 questions, one area is chosen uniformly at random. Then, one drawback from this area is sampled in keeping with a distribution that favors longer reasoning traces. This course of is repeated till 1,000 whole samples spanning 50 domains are obtained.
Some distilled generations are incorrect, which is allowed as the main focus is on capturing the reasoning course of moderately than fully appropriate options. 53.6% are deemed appropriate in s1K and 63.0% within the follow-up s1K-1.1.
Check-time scaling strategies are categorized into 1) Sequential, the place later computations rely upon earlier ones (e.g., an extended reasoning hint), and a pair of) Parallel, the place computations run in- dependently (e.g., majority voting). The main focus is on sequential scaling. New sequential scaling strategies and methods to benchmark them are proposed.
A most token rely is enforced by merely appending the end-of-thinking token delimiter and optionally “Closing Reply:” to early exit the pondering stage and make the mannequin present its present greatest reply. To implement a minimal, the era of the end-of-thinking token delimiter is suppressed and optionally the string “Wait” is appended to the mannequin’s present reasoning hint to encourage the mannequin to mirror on its present era.
Supervised finetuning is carried out on Qwen2.5–32B-Instruct utilizing s1K to acquire the mannequin s1–32B.
- The ensuing mannequin, s1–32B, achieves robust efficiency on the reasoning benchmarks, corresponding to a lot bigger fashions educated on considerably extra information.
- It demonstrates that fastidiously curated coaching information can considerably enhance pattern effectivity. s1–32B is probably the most sample-efficient open information reasoning mannequin.
- Finances forcing permits efficient test-time scaling, permitting for improved efficiency with elevated compute.
- Nevertheless, extreme suppression of the end-of-thinking token can result in repetitive loops and diminishing returns.
- Sequential scaling through price range forcing is more practical than parallel scaling (majority voting).
Seven days after the discharge of s1, s1.1 is launched. Traces for the 1,000 samples in s1K are regenerated utilizing DeepSeek r1 to create s1K-1.1. The identical coaching process is used to coach the mannequin s1.1. Different updates for the reason that launch embrace the discharge of o3, LIMO, and AIME 2025. s1.1 performs considerably higher than s1. Distilling from Claude 3.7 led to worse efficiency than from r1.
s1: Easy test-time scaling https://arxiv.org/abs/2501.19393