Archived
SnorkelSequences
A procedurally-generated and expert-verified benchmark for evaluating mathematical reasoning and compositional capabilities in LLMs.
Overview
The rise of Reinforcement Learning with Verifiable Rewards (RLVR) as a paradigm in LLM post-training has given rise to a new wave of models that are highly capable in mathematics. With our new SnorkelSequences benchmark, we are not only interested in evaluating mathematical problem solving, but also the compositional capabilities of LLMs. Strong models must be able to solve unseen problems that are formed from a series of simpler tasks. This new benchmark, part of our wider series of procedurally-generated and expert-verified datasets focused on reasoning, allows for both the mathematical and compositional capabilities to be evaluated.
Leaderboard
| Rank | Model | Score |
|---|---|---|
| 1 | gpt-5 |
77.6%
|
| 2 | gpt-5-mini |
77.6%
|
| 3 | gpt-5-nano |
72%
|
| 4 | GPT-5.4 |
71.6%
|
| 5 | o3-mini |
71.2%
|
| 6 | Gemini 2.5 Flash |
70.8%
|
| 7 | Claude Sonnet 4 |
70.4%
|
| 8 | Grok 4 Fast Reasoning |
70.2%
|
| 9 | o4 mini |
68.8%
|
| 10 | NVIDIA Nemotron Super 49B v1.5 |
66.8%
|
| 11 | Gemini 2.5 Pro |
66%
|
| 12 | Claude Opus 4 |
65.6%
|
| 13 | o3 |
65.2%
|
| 14 | Grok 4 |
63.2%
|
| 15 | Llama 4 Maverick |
62%
|
| 16 | Nova Premier |
51.8%
|
| 17 | Llama 4 Scout |
48.4%
|
| 18 | Claude Sonnet 3.7 |
47.6%
|
| 19 | Magistral Medium |
47.6%
|
| 20 | NVIDIA Nemotron Super 49B |
44.8%
|
| 21 | Nova Pro |
41.2%
|
| 22 | Nova Lite |
40%
|
| 23 | Grok 3 |
39.2%
|
| 24 | Llama 3.3 70B |
38.8%
|
| 25 | Mistral Large |
38.8%
|
| 26 | Codestral |
38.4%
|
| 27 | GPT-4.1 |
36.8%
|
| 28 | Nvidia 70B Instruct |
36.4%
|
| 29 | Kimi-K2-Thinking |
36%
|
| 30 | Llama 3.1 405B |
35.2%
|
| 31 | Nova Micro |
33.6%
|
| 32 | Qwen 3 235B |
28%
|
Sample task
The initial version of this benchmark includes 250 complex samples, with questions covering diverse combinations of operators, ranges, conditions, and transforms. The following is an example:
Consider the integers from 7 to 362, inclusive.
First, keep only the numbers that have common logarithm (base 10) less than 3.
Of these numbers, count how many perfect squares there are.
Methodology
metric
accuracy@1 across all 250 questions.
ground truth
Verifiable answers provided programmatically alongside each question; no LLM
judge required.
task set
250 compositional sequence reasoning questions spanning multiple operator
complexity levels.
future work
Exploring code generation as an alternative path: models that cannot answer directly may succeed by writing executable programs.
Behind the benchmark
All questions in SnorkelSequences require the LLM to compute the outcome of a function or operator applied to a sequence of numbers. For example: “How many digits are there in the numbers between and including 1 and 100?” Using a procedural data generation process allows us to create problems with a verifiable ground truth answer, and parameterize question complexity. For SnorkelSequences, we control complexity through the following parameters:
01
Size of the range
Larger ranges require the LLM to reason over and track a greater number of elements in each sequence.
02
Number of intermediate components
The number of simple tasks can be controlled by introducing components such as conditions and transforms on the original sequence.
03
Operator complexity
The core of each question is an “operator” that instructs the LLM to compute a numerical value over the final sequence (e.g., “count the number of digits”). We hypothesize that different families of operators pose different challenges to LLMs.

