Introducing SnorkelSpatial

A procedurally generated and programmatically verified benchmark for evaluating spatial reasoning capabilities in LLMs
Large language models (LLMs) are showing remarkable results on solving complex reasoning problems across domains—from mathematical proofs and logical puzzles to graduate-level science and engineering questions. On the other hand, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday tasks. We set out to test frontier models in this domain.
Why Spatial Reasoning Matters
Spatial reasoning is everywhere — from navigating a city map, to assembling furniture, to understanding how molecules interact in a chemical diagram. It’s a foundational cognitive skill that allows us to make sense of space, position, and movement.
But when it comes to LLMs, how well do they handle tasks that require tracking objects moving around in space? Can they mentally “rotate” a shape or keep track of an object as it moves and turns through a sequence of actions?
To answer these questions, we built SnorkelSpatial — a new benchmark that pushes LLMs to reason about space, movement, and orientation.
Inside the SnorkelSpatial World
SnorkelSpatial operates in a simple yet rich environment: a 2D grid world. Think of it like a small virtual board game. On this board, a few particles sit on tiles. The board and the particles can all move – forward, backward, left, or right — and they can rotate. Every action of the board affects the particles on it. So when the board shifts or rotates, the particles rotate with it.
In this world, each problem in the benchmark contains:
- The board layout and a sequence of actions
- A query for the model to answer
- The programmatically verified answer
The model must mentally simulate the world to answer questions like:
- “Where is Particle A now?”
- “Which way is the board facing?”
- “What’s the position of Particle B relative to Particle C?”
For each problem, we generate a programmatically verified ground truth answer, making the evaluation precise and unambiguously repeatable.
The complexity of the problems in the benchmark is adjusted by changing parameters such as board size, number of particles, allowed actions, and the number of actions performed. For SnorkelSpatial, we fix several of these variables and control the problem complexity via the number of actions.
The Building Blocks: Actions and Queries
Actions: Moving and Rotating the Board and Particles
Each problem begins with a sequence of actions — movements or rotations applied to the board or its particles. These actions change the spatial configuration over time.
- Movements shift positions without changing orientation. Additionally, if a particle’s movement causes it to cross the board’s boundary, the position is wrapped around from the opposite side of the board.
- Rotations alter orientation by 0, 90, 180, 270, or 360 degrees counterclockwise. (Note: while we acknowledge that rotation by 0 and 360 degrees are equivalent to a no-op, we keep these variations to test whether LLMs recognize this equivalence.)
For example, if the board facing north rotates 90° counterclockwise, it now faces west — and every particle on it rotates too. The complexity arises as multiple movements and rotations accumulate, forcing the model to mentally “simulate” the world step by step.

Queries: Asking the Model What It Knows
After these transformations, the model must answer questions designed to test spatial reasoning from both allocentric (absolute) and egocentric (relative) frames of reference; a fundamental dichotomy that stems from cognitive psychology.
The queries fall into several categories:
- Absolute location: What are the (x, y) coordinates?
- Tile queries: Which tile is it on?
- Absolute orientation: Which direction is it facing?
- Relative location/orientation: What is the relative location or orientation, relative to another’s.
Each query probes a slightly different type of spatial understanding — and together, they create a comprehensive test of how well LLMs can “think in space.”
How We Built and Tested It
For the benchmark, we fixed the board size (20×20), randomly placed three particles, and generated 330 problems, each with 10, 20, 50, 100, or 200 actions. Intuitively, we expect that more actions make the task harder.
Each model response is evaluated against the verifiable ground truth answer. We attempt each question 10 times, and report accuracy@1 across all 330 questions in the results below. When a model fails all 10 attempts, we treat it as incorrect.
Key Findings
Overall accuracies: The models particularly tuned for reasoning tasks are the strongest performers here. Only a handful of models, grok-4-fast, o3, gpt-5, and gpt-oss, exceed 50% accuracy, while several other models, particularly older generation models, fall significantly short.

Accuracy vs. number of actions: Intuitively, we expect that the longer the list of actions, the harder the problem. This breakdown of the top 5 models shows that the results hold true to our expectation.

Query-Specific Insights
Different question types reveal different strengths and weaknesses:
i) Orientation queries are easiest — likely because there are only a few possible directions (north, south, east, west).
ii) Tile queries are the most challenging for the top models: Compared to the location and orientation queries, finding the tile on which a particle is located involves additional reasoning. We can see in the following results that the top 5 models have the lowest performance on these types of queries.
iii) Absolute (allocentric) vs. relative (egocentric) queries: Since relative queries depend on two objects (particles or board), they require accurate assessment of the states of the both objects and thus are more difficult than the absolute queries. The following figure shows that models perform worse on the relative queries compared to the absolute queries for the same attributes.


Closing Thoughts
Spatial reasoning sits at the crossroads of language, logic, and visualization — and understanding it will be key to building more capable and trustworthy AI systems.
SnorkelSpatial provides a systematic framework for evaluating spatial reasoning in LLMs through procedurally generated, programmatically verified problems that test both allocentric and egocentric reasoning across varying complexity levels. Our results show a wide variance in performance among the most popular models available, with decline in accuracies with increasing problem complexity.
In follow-up work, we plan to explore LLM capabilities in solving these problems by generating code or doing visual chain-of-thought reasoning.
If your project relies on high-quality, expert-verified data or you’re building models that need to reason across complex domains, we’d love to collaborate. Come talk to us — and let’s push the boundaries of reasoning together.
Harit Vishwakarma is a Research Intern at Snorkel AI, focusing on evaluating and improving the reasoning capabilities of large language models. He recently completed his PhD in Computer Science at the University of Wisconsin–Madison. His research centers on studying and developing methods for reliable inference and leveraging them for automated data labeling and enhancing performance at test time. Next, he is off to the University of Oxford for a postdoc.