SnorkelGraph

A procedurally-generated and expert verified benchmark for evaluating mathematical and spatial reasoning capabilities of LLMs through graph reasoning problems.

Overview

Many real-world applications require an LLM to perform multi-hop long-context reasoning over an input, and in the process uncover the explicit and implicit relationships that typically exist within unstructured text. With our new SnorkelGraph benchmark, we evaluate these capabilities when reasoning across more formal structures with verifiable ground truth. This new benchmark, part of our wider series of procedurally-generated and expert verified datasets, focused on reasoning, allows for both multi-hop and mathematical reasoning capabilities to be evaluated.

All questions in the SnorkelGraph dataset require the LLM to compute the outcome of a natural language question (an operator) asked over a graph structure encoded in natural language through node and edge lists. For example: "Find the minimum density subgraph over …". Using a procedural data generation process allows the creation of problems with a verifiable ground truth answer, which is confirmed by experts (or set of allowable answers), and parameterize question complexity. For SnorkelGraph, we control complexity through the following parameters:

Size of the graph - Larger graphs (either by number of nodes or edges) require the LLM to reason over and track a greater number of elements to answer a question.

Operator complexity - The core of each question is an "operator" that instructs the LLM to compute a function over a graph. We hypothesize that different families of operators pose different challenges to LLMs.

Data Sample

The initial version of this benchmark contains 200 QA pairs, with questions covering diverse combinations of operators, ranges, and conditions. The following is an example:

question

Find the minimum vertex cover of an undirected graph with 15 nodes. Find the smallest set of vertices that cover all edges.

Graph: Nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] Edges: [(0, 1), (0, 0), (0, 2), (0, 3), (0, 4), (1, 2), (2, 2), (2, 3), (2, 5), (2, 7), (2, 8), (2, 9), (2, 10), (2, 11), (2, 14), (3, 13), (4, 6), (4, 12), (8, 8)]

solution

The minimum vertex cover is: [0, 2, 3, 4, 8]

Evaluation Methodology

The SnorkelGraph benchmark is evaluated using the provided, verifiable ground-truth answers (or complete set of allowable answers) for each of the 200 questions. Specifically, each model receives exactly one prompt within a max tokens budget of 16384 tokens, uses default LLM parameters, and is asked to generate its full chain of thought and a final answer. The final answer is then submitted to a secondary LLM (GPT-4o) which reformats it into the canonical internal representation.

A programmatic graph validator leveraging graph-based libraries compares the parsed output against the expected answer set to determine correctness. Reported results are the overall accuracy@1 across all 200 questions.

Note on Grok-4 Results:
Grok-4's accuracy is 61% under the default 16,384-token limit. At this setting, the model did not produce final answers in a significant portion of cases. With a higher 65,536-token budget, accuracy rose to 69.5%. We report the default result for comparability but note this sensitivity to token limits for context.

Grok 4 Fast Reasoning

75%

o4 mini

75%

gpt-5-mini

72.5%

gpt-5

72%

71.5%

o3-mini

71%

Claude Opus 4

64.5%

Grok 3

64%

GPT-4.1

63%

gpt-5-nano

62.5%

Qwen 3 235B

61.5%

Grok 4

61%

Claude Sonnet 4

58%

Gemini 2.5 Pro

58%

Gemini 2.5 Flash

55%

Magistral Medium

53.5%

Claude Sonnet 3.7

50%

Nova Premier

34.5%

Llama 4 Maverick

34%

Mistral Large

30%

Nvidia nemotron super 49B

29%

Nova Pro

28%

Llama 4 Scout

26%

Codestral

24.5%

Llama 3.3 70B

23.5%

Nvidia 70B Instruct

22.5%

Llama 3.1 405B

20.5%

Nova Lite

19%

Nova Micro

17.5%

Command R+

15%

Command-Light

10.5%

Command

10%

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.

Talk to Snorkel