SnorkelGraph
Overview
Many real-world applications require an LLM to perform multi-hop long-context reasoning over an input, and in the process uncover the explicit and implicit relationships that typically exist within unstructured text. With our new SnorkelGraph benchmark, we evaluate these capabilities when reasoning across more formal structures with verifiable ground truth. This new benchmark, part of our wider series of procedurally-generated and expert verified datasets, focused on reasoning, allows for both multi-hop and mathematical reasoning capabilities to be evaluated.
All questions in the SnorkelGraph dataset require the LLM to compute the outcome of a natural language question (an operator) asked over a graph structure encoded in natural language through node and edge lists. For example: "Find the minimum density subgraph over …". Using a procedural data generation process allows the creation of problems with a verifiable ground truth answer, which is confirmed by experts (or set of allowable answers), and parameterize question complexity. For SnorkelGraph, we control complexity through the following parameters:
Size of the graph - Larger graphs (either by number of nodes or edges) require the LLM to reason over and track a greater number of elements to answer a question.Operator complexity - The core of each question is an "operator" that instructs the LLM to compute a function over a graph. We hypothesize that different families of operators pose different challenges to LLMs.Model Comparison
Data Sample
The initial version of this benchmark contains 200 QA pairs, with questions covering diverse combinations of operators, ranges, and conditions. The following is an example:

Question:
Find the minimum vertex cover of an undirected graph with 15 nodes. Find the smallest set of vertices that cover all edges.
Graph:
Nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Edges: [(0, 1), (0, 0), (0, 2), (0, 3), (0, 4), (1, 2), (2, 2), (2, 3), (2, 5), (2, 7), (2, 8), (2, 9), (2, 10), (2, 11), (2, 14), (3, 13), (4, 6), (4, 12), (8, 8)]
Solution:
The minimum vertex cover is: [0, 2, 3, 4, 8]
Evaluation Methodology
The SnorkelGraph benchmark is evaluated using the provided, verifiable ground-truth answers (or complete set of allowable answers) for each of the 200 questions. Specifically, each model receives exactly one prompt within a max tokens budget of 16384 tokens, uses default LLM parameters, and is asked to generate its full chain of thought and a final answer. The final answer is then submitted to a secondary LLM (GPT-4o) which reformats it into the canonical internal representation.
A programmatic graph validator leveraging graph-based libraries compares the parsed output against the expected answer set to determine correctness. Reported results are the overall accuracy@1 across all 200 questions.
Note on Grok-4 Results:
Grok-4's accuracy is 61% under the default 16,384-token limit. At this setting, the model did not produce final answers in a significant portion of cases. With a higher 65,536-token budget, accuracy rose to 69.5%. We report the default result for comparability but note this sensitivity to token limits for context.