Back to benchmarks
Released April 04, 2026
Archived

SnorkelGraph

A procedurally-generated and expert-verified benchmark for evaluating mathematical and spatial reasoning capabilities of LLMs through graph reasoning problems.

Overview

Many real-world applications require an LLM to perform multi-hop long-context reasoning over an input, and in the process uncover the explicit and implicit relationships that typically exist within unstructured text. With our new SnorkelGraph benchmark, we evaluate these capabilities when reasoning across more formal structures with verifiable ground truth. This new benchmark, part of our wider series of procedurally-generated and expert-verified datasets, focused on reasoning, allows for both multi-hop and mathematical reasoning capabilities to be evaluated.

Leaderboard

Rank Model Score
1 GPT-5.4
84.5%
2 Grok 4 Fast Reasoning
75%
3 o4 mini
75%
4 gpt-5-mini
72.5%
5 gpt-5
72%
6 o3
71.5%
7 o3-mini
71%
8 Claude Opus 4
64.5%
9 Grok 3
64%
10 GPT-4.1
63%
11 gpt-5-nano
62.5%
12 Qwen 3 235B
61.5%
13 Grok 4
61%
14 Claude Sonnet 4
58%
15 Gemini 2.5 Pro
58%
16 Gemini 2.5 Flash
55%
17 Magistral Medium
53.5%
18 Claude Sonnet 3.7
50%
19 Nova Premier
34.5%
20 Llama 4 Maverick
34%
21 Mistral Large
30%
22 Nvidia nemotron super 49B
29%
23 Nova Pro
28%
24 Llama 4 Scout
26%
25 Codestral
24.5%
26 Llama 3.3 70B
23.5%
27 Nvidia 70B Instruct
22.5%
28 Llama 3.1 405B
20.5%
29 Nova Lite
19%
30 Nova Micro
17.5%
31 Command R+
15%
32 Command-Light
10.5%
33 Command
10%

Sample task

The initial version of this benchmark contains 200 QA pairs, with questions covering diverse combinations of operators, ranges, and conditions. The following is an example:
Image

Find the minimum vertex cover of an undirected graph with 15 nodes. Find the smallest set of vertices that cover all edges.

Graph:
Nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

Edges: [(0, 1), (0, 0), (0, 2), (0, 3), (0, 4), (1, 2), (2, 2), (2, 3), (2, 5), (2, 7), (2, 8), (2, 9), (2, 10), (2, 11), (2, 14), (3, 13), (4, 6), (4, 12), (8, 8)]

Methodology

metric
accuracy@1 across all 200 questions, compared against verifiable ground-truth
answer sets.
Token Budget
16,384 tokens max (default parameters). One prompt per question; full chain-of-thought plus final answer required.
Answer Parsing

Final answers are reformatted into a canonical representation by a secondary LLM (GPT-4o), then validated by a programmatic graph validator.

Task Set
200 graph reasoning problems, including combinatorial and structural tasks such as minimum vertex cover.
note on grok-4 results
Grok-4’s accuracy is 61% under the default 16,384-token limit. At this setting, the model did not produce final answers in a significant portion of cases. With a higher 65,536-token budget, accuracy rose to 69.5%. We report the default result for comparability but note this sensitivity to token limits for context.

Behind the benchmark

All questions in the SnorkelGraph dataset require the LLM to compute the outcome of a natural language question (an operator) asked over a graph structure encoded in natural language through node and edge lists. For example: “Find the minimum density subgraph over …”. Using a procedural data generation process allows the creation of problems with a verifiable ground truth answer, which is confirmed by experts (or set of allowable answers), and parameterize question complexity. For SnorkelGraph, we control complexity through the following parameters:

01
Size of the graph 
Larger graphs (either by number of nodes or edges) require the LLM to reason over and track a greater number of elements to answer a question.
02
Operator complexity
The core of each question is an "operator" that instructs the LLM to compute a function over a graph. We hypothesize that different families of operators pose different challenges to LLMs.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.