SnorkelSequences

A procedurally-generated and expert-verified benchmark for evaluating mathematical reasoning and compositional capabilities in LLMs.

Overview

The rise of Reinforcement Learning with Verifiable Rewards (RLVR) as a paradigm in LLM post-training has given rise to a new wave of models that are highly capable in mathematics. With our new SnorkelSequences benchmark, we are not only interested in evaluating mathematical problem solving, but also the compositional capabilities of LLMs — strong models must be able to solve unseen problems that are formed from a series of simpler tasks. This new benchmark, part of our wider series of procedurally-generated and expert-verified datasets, focused on reasoning, allows for both the mathematical and compositional capabilities to be evaluated.

All questions in SnorkelSequences require the LLM to compute the outcome of a function or operator applied to a sequence of numbers. For example: "How many digits are there in the numbers between and including 1 and 100?". Using a procedural data generation process allows us to create problems with a verifiable ground truth answer, and parameterize question complexity. For SnorkelSequences, we control complexity through the following parameters:

Size of the range - Larger ranges require the LLM to reason over and track a greater number of elements in each sequence.

Number of intermediate components - The number of simple tasks can be controlled by introducing components such as conditions and transforms on the original sequence.

Operator complexity - The core of each question is an "operator" that instructs the LLM to compute a numerical value over the final sequence (e.g., "count the number of digits"). We hypothesize that different families of operators pose different challenges to LLMs.

Data Sample

The initial version of this benchmark includes 250 complex samples, with questions covering diverse combinations of operators, ranges, conditions, and transforms. The following is an example:

question

Consider the integers from 7 to 362, inclusive. First, keep only the numbers that have common logarithm (base 10) less than 3. Of these numbers, count how many perfect squares there are.

solution

First, consider the range from 7 to 362 inclusive.

A common logarithm less than 3 corresponds to keeping numbers less than 10^3 (1,000). All numbers in the original range are less than 1,000.

The smallest perfect square (2^k) in the range is k=3 and the largest in the range is k=19. Therefore there are 17 valid perfect square numbers. The solution is 17.

Evaluation Methodology

Evaluation uses the verifiable ground truth answer provided alongside each question. The following results report the accuracy@1 across all 250 questions.

In follow-up work, we will explore the ability of LLMs to solve compositional tasks via code generation — while a model may not be able to answer a natural language question directly, perhaps it can generate executable code that has a higher likelihood of producing the correct answer.

gpt-5

77.6%

gpt-5-mini

77.6%

gpt-5-nano

72%

o3-mini

71.2%

Gemini 2.5 Flash

70.8%

Claude Sonnet 4

70.4%

Grok 4 Fast Reasoning

70.2%

o4 mini

68.8%

NVIDIA Nemotron Super 49B v1.5

66.8%

Gemini 2.5 Pro

66%

Claude Opus 4

65.6%

65.2%

Grok 4

63.2%

Llama 4 Maverick

62%

Nova Premier

51.8%

Llama 4 Scout

48.4%

Claude Sonnet 3.7

47.6%

Magistral Medium

47.6%

Nvidia nemotron super 49B

44.8%

Nova Pro

41.2%

Nova Lite

40%

Grok 3

39.2%

Llama 3.3 70B

38.8%

Mistral Large

38.8%

Codestral

38.4%

GPT-4.1

36.8%

Nvidia 70B Instruct

36.4%

Kimi-K2-Thinking

36%

Llama 3.1 405B

35.2%

Nova Micro

33.6%

Qwen 3 235B

28%

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.

Talk to Snorkel