Archived

SnorkelSequences

A procedurally-generated and expert-verified benchmark for evaluating mathematical reasoning and compositional capabilities in LLMs.

Overview

The rise of Reinforcement Learning with Verifiable Rewards (RLVR) as a paradigm in LLM post-training has given rise to a new wave of models that are highly capable in mathematics. With our new SnorkelSequences benchmark, we are not only interested in evaluating mathematical problem solving, but also the compositional capabilities of LLMs. Strong models must be able to solve unseen problems that are formed from a series of simpler tasks. This new benchmark, part of our wider series of procedurally-generated and expert-verified datasets focused on reasoning, allows for both the mathematical and compositional capabilities to be evaluated.

Leaderboard

Rank	Model	Score
1	gpt-5	77.6%
2	gpt-5-mini	77.6%
3	gpt-5-nano	72%
4	GPT-5.4	71.6%
5	o3-mini	71.2%
6	Gemini 2.5 Flash	70.8%
7	Claude Sonnet 4	70.4%
8	Grok 4 Fast Reasoning	70.2%
9	o4 mini	68.8%
10	NVIDIA Nemotron Super 49B v1.5	66.8%
11	Gemini 2.5 Pro	66%
12	Claude Opus 4	65.6%
13	o3	65.2%
14	Grok 4	63.2%
15	Llama 4 Maverick	62%
16	Nova Premier	51.8%
17	Llama 4 Scout	48.4%
18	Claude Sonnet 3.7	47.6%
19	Magistral Medium	47.6%
20	NVIDIA Nemotron Super 49B	44.8%
21	Nova Pro	41.2%
22	Nova Lite	40%
23	Grok 3	39.2%
24	Llama 3.3 70B	38.8%
25	Mistral Large	38.8%
26	Codestral	38.4%
27	GPT-4.1	36.8%
28	Nvidia 70B Instruct	36.4%
29	Kimi-K2-Thinking	36%
30	Llama 3.1 405B	35.2%
31	Nova Micro	33.6%
32	Qwen 3 235B	28%

Sample task

The initial version of this benchmark includes 250 complex samples, with questions covering diverse combinations of operators, ranges, conditions, and transforms. The following is an example:

Consider the integers from 7 to 362, inclusive.

First, keep only the numbers that have common logarithm (base 10) less than 3.

Of these numbers, count how many perfect squares there are.

Methodology

metric

accuracy@1 across all 250 questions.

ground truth

Verifiable answers provided programmatically alongside each question; no LLM judge required.

task set

250 compositional sequence reasoning questions spanning multiple operator complexity levels.

future work

Exploring code generation as an alternative path: models that cannot answer directly may succeed by writing executable programs.

Behind the benchmark

All questions in SnorkelSequences require the LLM to compute the outcome of a function or operator applied to a sequence of numbers. For example: “How many digits are there in the numbers between and including 1 and 100?” Using a procedural data generation process allows us to create problems with a verifiable ground truth answer, and parameterize question complexity. For SnorkelSequences, we control complexity through the following parameters:

Size of the range

Larger ranges require the LLM to reason over and track a greater number of elements in each sequence.

Number of intermediate components

The number of simple tasks can be controlled by introducing components such as conditions and transforms on the original sequence.

Operator complexity

The core of each question is an “operator” that instructs the LLM to compute a numerical value over the final sequence (e.g., “count the number of digits”). We hypothesize that different families of operators pose different challenges to LLMs.

Get notified when we launch a new benchmark

Your browser is currently blocking scripts, which prevents the form from loading.
Please enable scripts and refresh the page to continue.

Share this benchmark