SnorkelSequences
Overview
The rise of Reinforcement Learning with Verifiable Rewards (RLVR) as a paradigm in LLM post-training has given rise to a new wave of models that are highly capable in mathematics. With our new SnorkelSequences benchmark, we are not only interested in evaluating mathematical problem solving, but also the compositional capabilities of LLMs — strong models must be able to solve unseen problems that are formed from a series of simpler tasks. This new benchmark, part of our wider series of procedurally-generated and expert-verified datasets, focused on reasoning, allows for both the mathematical and compositional capabilities to be evaluated.
All questions in SnorkelSequences require the LLM to compute the outcome of a function or operator applied to a sequence of numbers. For example: "How many digits are there in the numbers between and including 1 and 100?". Using a procedural data generation process allows us to create problems with a verifiable ground truth answer, and parameterize question complexity. For SnorkelSequences, we control complexity through the following parameters:
Size of the range - Larger ranges require the LLM to reason over and track a greater number of elements in each sequence.Number of intermediate components - The number of simple tasks can be controlled by introducing components such as conditions and transforms on the original sequence.Operator complexity - The core of each question is an "operator" that instructs the LLM to compute a numerical value over the final sequence (e.g., "count the number of digits"). We hypothesize that different families of operators pose different challenges to LLMs.
Model Comparison
Data Sample
The initial version of this benchmark includes 250 complex samples, with questions covering diverse combinations of operators, ranges, conditions, and transforms. The following is an example:
Question:
Consider the integers from 7 to 362, inclusive. First, keep only the numbers that have common logarithm (base 10) less than 3. Of these numbers, count how many perfect squares there are.
Solution:
First, consider the range from 7 to 362 inclusive.
A common logarithm less than 3 corresponds to keeping numbers less than 10^3 (1,000). All numbers in the original range are less than 1,000.
The smallest perfect square (2^k) in the range is k=3 and the largest in the range is k=19. Therefore there are 17 valid perfect square numbers. The solution is 17.
Evaluation Methodology
Evaluation uses the verifiable ground truth answer provided alongside each question. The following results report the accuracy@1 across all 250 questions.
In follow-up work, we will explore the ability of LLMs to solve compositional tasks via code generation — while a model may not be able to answer a natural language question directly, perhaps it can generate executable code that has a higher likelihood of producing the correct answer.