Back to benchmarks
Released April 04, 2026
Archived

SnorkelSequences

A procedurally-generated and expert-verified benchmark for evaluating mathematical reasoning and compositional capabilities in LLMs.
Overview
The rise of Reinforcement Learning with Verifiable Rewards (RLVR) as a paradigm in LLM post-training has given rise to a new wave of models that are highly capable in mathematics. With our new SnorkelSequences benchmark, we are not only interested in evaluating mathematical problem solving, but also the compositional capabilities of LLMs. Strong models must be able to solve unseen problems that are formed from a series of simpler tasks. This new benchmark, part of our wider series of procedurally-generated and expert-verified datasets focused on reasoning, allows for both the mathematical and compositional capabilities to be evaluated.

Leaderboard

Rank Model Score
1 gpt-5
77.6%
2 gpt-5-mini
77.6%
3 gpt-5-nano
72%
4 GPT-5.4
71.6%
5 o3-mini
71.2%
6 Gemini 2.5 Flash
70.8%
7 Claude Sonnet 4
70.4%
8 Grok 4 Fast Reasoning
70.2%
9 o4 mini
68.8%
10 NVIDIA Nemotron Super 49B v1.5
66.8%
11 Gemini 2.5 Pro
66%
12 Claude Opus 4
65.6%
13 o3
65.2%
14 Grok 4
63.2%
15 Llama 4 Maverick
62%
16 Nova Premier
51.8%
17 Llama 4 Scout
48.4%
18 Claude Sonnet 3.7
47.6%
19 Magistral Medium
47.6%
20 NVIDIA Nemotron Super 49B
44.8%
21 Nova Pro
41.2%
22 Nova Lite
40%
23 Grok 3
39.2%
24 Llama 3.3 70B
38.8%
25 Mistral Large
38.8%
26 Codestral
38.4%
27 GPT-4.1
36.8%
28 Nvidia 70B Instruct
36.4%
29 Kimi-K2-Thinking
36%
30 Llama 3.1 405B
35.2%
31 Nova Micro
33.6%
32 Qwen 3 235B
28%

Sample task

The initial version of this benchmark includes 250 complex samples, with questions covering diverse combinations of operators, ranges, conditions, and transforms. The following is an example:

Consider the integers from 7 to 362, inclusive.

First, keep only the numbers that have common logarithm (base 10) less than 3.

Of these numbers, count how many perfect squares there are.

Methodology

metric
accuracy@1 across all 250 questions.
ground truth
Verifiable answers provided programmatically alongside each question; no LLM
judge required.
task set
250 compositional sequence reasoning questions spanning multiple operator
complexity levels.
future work
Exploring code generation as an alternative path: models that cannot answer directly may succeed by writing executable programs.

Behind the benchmark

All questions in SnorkelSequences require the LLM to compute the outcome of a function or operator applied to a sequence of numbers. For example: “How many digits are there in the numbers between and including 1 and 100?” Using a procedural data generation process allows us to create problems with a verifiable ground truth answer, and parameterize question complexity. For SnorkelSequences, we control complexity through the following parameters:
01
Size of the range
Larger ranges require the LLM to reason over and track a greater number of elements in each sequence.
02
Number of intermediate components
The number of simple tasks can be controlled by introducing components such as conditions and transforms on the original sequence.
03
Operator complexity
The core of each question is an “operator” that instructs the LLM to compute a numerical value over the final sequence (e.g., “count the number of digits”). We hypothesize that different families of operators pose different challenges to LLMs.
Image

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.