SnorkelGraph
A procedurally-generated and expert-verified benchmark for evaluating mathematical and spatial reasoning capabilities of LLMs through graph reasoning problems.
Many real-world applications require an LLM to perform multi-hop long-context reasoning over an input, and in the process uncover the explicit and implicit relationships that typically exist within unstructured text. With our new SnorkelGraph benchmark, we evaluate these capabilities when reasoning across more formal structures with verifiable ground truth. This new benchmark, part of our wider series of procedurally-generated and expert-verified datasets, focused on reasoning, allows for both multi-hop and mathematical reasoning capabilities to be evaluated.
Leaderboard
| Rank | Model | Score |
|---|---|---|
| 1 | GPT-5.4 |
84.5%
|
| 2 | Grok 4 Fast Reasoning |
75%
|
| 3 | o4 mini |
75%
|
| 4 | gpt-5-mini |
72.5%
|
| 5 | gpt-5 |
72%
|
| 6 | o3 |
71.5%
|
| 7 | o3-mini |
71%
|
| 8 | Claude Opus 4 |
64.5%
|
| 9 | Grok 3 |
64%
|
| 10 | GPT-4.1 |
63%
|
| 11 | gpt-5-nano |
62.5%
|
| 12 | Qwen 3 235B |
61.5%
|
| 13 | Grok 4 |
61%
|
| 14 | Claude Sonnet 4 |
58%
|
| 15 | Gemini 2.5 Pro |
58%
|
| 16 | Gemini 2.5 Flash |
55%
|
| 17 | Magistral Medium |
53.5%
|
| 18 | Claude Sonnet 3.7 |
50%
|
| 19 | Nova Premier |
34.5%
|
| 20 | Llama 4 Maverick |
34%
|
| 21 | Mistral Large |
30%
|
| 22 | Nvidia nemotron super 49B |
29%
|
| 23 | Nova Pro |
28%
|
| 24 | Llama 4 Scout |
26%
|
| 25 | Codestral |
24.5%
|
| 26 | Llama 3.3 70B |
23.5%
|
| 27 | Nvidia 70B Instruct |
22.5%
|
| 28 | Llama 3.1 405B |
20.5%
|
| 29 | Nova Lite |
19%
|
| 30 | Nova Micro |
17.5%
|
| 31 | Command R+ |
15%
|
| 32 | Command-Light |
10.5%
|
| 33 | Command |
10%
|
Sample task


Find the minimum vertex cover of an undirected graph with 15 nodes. Find the smallest set of vertices that cover all edges.
Graph:
Nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Edges: [(0, 1), (0, 0), (0, 2), (0, 3), (0, 4), (1, 2), (2, 2), (2, 3), (2, 5), (2, 7), (2, 8), (2, 9), (2, 10), (2, 11), (2, 14), (3, 13), (4, 6), (4, 12), (8, 8)]
Methodology
Final answers are reformatted into a canonical representation by a secondary LLM (GPT-4o), then validated by a programmatic graph validator.
Behind the benchmark
All questions in the SnorkelGraph dataset require the LLM to compute the outcome of a natural language question (an operator) asked over a graph structure encoded in natural language through node and edge lists. For example: “Find the minimum density subgraph over …”. Using a procedural data generation process allows the creation of problems with a verifiable ground truth answer, which is confirmed by experts (or set of allowable answers), and parameterize question complexity. For SnorkelGraph, we control complexity through the following parameters:

