Archived

SnorkelGraph

A procedurally-generated and expert-verified benchmark for evaluating mathematical and spatial reasoning capabilities of LLMs through graph reasoning problems.

Overview

Many real-world applications require an LLM to perform multi-hop long-context reasoning over an input, and in the process uncover the explicit and implicit relationships that typically exist within unstructured text. With our new SnorkelGraph benchmark, we evaluate these capabilities when reasoning across more formal structures with verifiable ground truth. This new benchmark, part of our wider series of procedurally-generated and expert-verified datasets, focused on reasoning, allows for both multi-hop and mathematical reasoning capabilities to be evaluated.

Leaderboard

Rank	Model	Score
1	GPT-5.4	84.5%
2	Grok 4 Fast Reasoning	75%
3	o4 mini	75%
4	gpt-5-mini	72.5%
5	gpt-5	72%
6	o3	71.5%
7	o3-mini	71%
8	Claude Opus 4	64.5%
9	Grok 3	64%
10	GPT-4.1	63%
11	gpt-5-nano	62.5%
12	Qwen 3 235B	61.5%
13	Grok 4	61%
14	Claude Sonnet 4	58%
15	Gemini 2.5 Pro	58%
16	Gemini 2.5 Flash	55%
17	Magistral Medium	53.5%
18	Claude Sonnet 3.7	50%
19	Nova Premier	34.5%
20	Llama 4 Maverick	34%
21	Mistral Large	30%
22	Nvidia nemotron super 49B	29%
23	Nova Pro	28%
24	Llama 4 Scout	26%
25	Codestral	24.5%
26	Llama 3.3 70B	23.5%
27	Nvidia 70B Instruct	22.5%
28	Llama 3.1 405B	20.5%
29	Nova Lite	19%
30	Nova Micro	17.5%
31	Command R+	15%
32	Command-Light	10.5%
33	Command	10%

Sample task

The initial version of this benchmark contains 200 QA pairs, with questions covering diverse combinations of operators, ranges, and conditions. The following is an example:

Find the minimum vertex cover of an undirected graph with 15 nodes. Find the smallest set of vertices that cover all edges.

Graph:
Nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

Edges: [(0, 1), (0, 0), (0, 2), (0, 3), (0, 4), (1, 2), (2, 2), (2, 3), (2, 5), (2, 7), (2, 8), (2, 9), (2, 10), (2, 11), (2, 14), (3, 13), (4, 6), (4, 12), (8, 8)]

Methodology

metric

accuracy@1 across all 200 questions, compared against verifiable ground-truth answer sets.

Token Budget

16,384 tokens max (default parameters). One prompt per question; full chain-of-thought plus final answer required.

Answer Parsing

Final answers are reformatted into a canonical representation by a secondary LLM (GPT-4o), then validated by a programmatic graph validator.

Task Set

200 graph reasoning problems, including combinatorial and structural tasks such as minimum vertex cover.

note on grok-4 results

Grok-4’s accuracy is 61% under the default 16,384-token limit. At this setting, the model did not produce final answers in a significant portion of cases. With a higher 65,536-token budget, accuracy rose to 69.5%. We report the default result for comparability but note this sensitivity to token limits for context.

Behind the benchmark

All questions in the SnorkelGraph dataset require the LLM to compute the outcome of a natural language question (an operator) asked over a graph structure encoded in natural language through node and edge lists. For example: “Find the minimum density subgraph over …”. Using a procedural data generation process allows the creation of problems with a verifiable ground truth answer, which is confirmed by experts (or set of allowable answers), and parameterize question complexity. For SnorkelGraph, we control complexity through the following parameters:

Size of the graph

Larger graphs (either by number of nodes or edges) require the LLM to reason over and track a greater number of elements to answer a question.

Operator complexity

The core of each question is an "operator" that instructs the LLM to compute a function over a graph. We hypothesize that different families of operators pose different challenges to LLMs.

Get notified when we launch a new benchmark

Your browser is currently blocking scripts, which prevents the form from loading.
Please enable scripts and refresh the page to continue.

Share this benchmark