Data development
Research

Introducing SnorkelSpatial

October 24, 2025
6 min read

A procedurally generated and programmatically verified benchmark for evaluating spatial reasoning capabilities in LLMs

Large language models (LLMs) are showing remarkable results on solving complex reasoning problems across domains—from mathematical proofs and logical puzzles to graduate-level science and engineering questions. On the other hand, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday tasks. We set out to test frontier models in this domain.

Why Spatial Reasoning Matters

Spatial reasoning is everywhere — from navigating a city map, to assembling furniture, to understanding how molecules interact in a chemical diagram. It’s a foundational cognitive skill that allows us to make sense of space, position, and movement.

But when it comes to LLMs, how well do they handle tasks that require tracking objects moving around in space? Can they mentally “rotate” a shape or keep track of an object as it moves and turns through a sequence of actions?

To answer these questions, we built SnorkelSpatial — a new benchmark that pushes LLMs to reason about space, movement, and orientation.

Inside the SnorkelSpatial World

SnorkelSpatial operates in a simple yet rich environment: a 2D grid world. Think of it like a small virtual board game. On this board, a few particles sit on tiles. The board and the particles can all move – forward, backward, left, or right — and they can rotate. Every action of the board affects the particles on it. So when the board shifts or rotates, the particles rotate with it.

In this world, each problem in the benchmark contains:

  • The board layout and a sequence of actions
  • A query for the model to answer
  • The programmatically verified answer

The model must mentally simulate the world to answer questions like:

  • “Where is Particle A now?”
  • “Which way is the board facing?”
  • “What’s the position of Particle B relative to Particle C?”

For each problem, we generate a  programmatically verified ground truth answer, making the evaluation precise and unambiguously repeatable.

The complexity of the problems in the benchmark is  adjusted by changing parameters such as board size, number of particles, allowed actions, and the number of actions performed. For SnorkelSpatial, we fix several of these variables and control the problem complexity via the number of actions.

The Building Blocks: Actions and Queries

Actions: Moving and Rotating the Board and Particles

Each problem begins with a sequence of actions — movements or rotations applied to the board or its particles. These actions change the spatial configuration over time.

  • Movements shift positions without changing orientation. Additionally, if a particle’s movement causes it to cross the board’s boundary, the position is wrapped around from the opposite side of the board.
  • Rotations alter orientation by 0, 90, 180, 270, or 360 degrees counterclockwise. (Note: while we acknowledge that rotation by 0 and 360 degrees are equivalent to a no-op, we keep these variations to test whether LLMs recognize this equivalence.)

For example, if the board facing north rotates 90° counterclockwise, it now faces west — and every particle on it rotates too. The complexity arises as multiple movements and rotations accumulate, forcing the model to mentally “simulate” the world step by step.

Queries: Asking the Model What It Knows

After these transformations, the model must answer questions designed to test spatial reasoning from both allocentric (absolute) and egocentric (relative) frames of reference; a fundamental dichotomy that stems from cognitive psychology.

The queries fall into several categories:

  • Absolute location: What are the (x, y) coordinates?
  • Tile queries: Which tile is it on?
  • Absolute orientation: Which direction is it facing?
  • Relative location/orientation: What is the relative location or orientation, relative to another’s.

Each query probes a slightly different type of spatial understanding — and together, they create a comprehensive test of how well LLMs can “think in space.”

How We Built and Tested It

For the benchmark, we fixed the board size (20×20), randomly placed three particles, and generated 330 problems, each with 10, 20, 50, 100, or 200 actions. Intuitively, we expect that more actions make the task harder.

Each model response is evaluated against the verifiable ground truth answer. We attempt each question 10 times, and report accuracy@1 across all 330 questions in the results below. When a model fails all 10 attempts, we treat it as incorrect.

Key Findings

Overall accuracies: The models particularly tuned for reasoning tasks are the strongest performers here. Only a handful of models, grok-4-fast, o3, gpt-5, and gpt-oss, exceed 50% accuracy, while several other models, particularly older generation models, fall significantly short.

Accuracy vs. number of actions: Intuitively, we expect that the longer the list of actions, the harder the problem. This breakdown of the top 5 models shows that the results hold true to our expectation.

Chart of accuracy scores of top models as the number of actions varies

Query-Specific Insights

Different question types reveal different strengths and weaknesses:

i) Orientation queries are easiest — likely because there are only a few possible directions (north, south, east, west).

ii) Tile queries are the most challenging for the top models: Compared to the location and orientation queries, finding the tile on which a particle is located involves additional reasoning. We can see in the following results that the top 5 models have the lowest performance on these types of queries. 

iii) Absolute (allocentric) vs. relative (egocentric) queries: Since relative queries depend on two objects (particles or board), they require accurate assessment of the states of the both objects and thus are more difficult than the absolute queries. The following figure shows that models perform worse on the relative queries compared to the absolute queries for the same attributes.

Closing Thoughts

Spatial reasoning sits at the crossroads of language, logic, and visualization — and understanding it will be key to building more capable and trustworthy AI systems.

SnorkelSpatial provides a systematic framework for evaluating spatial reasoning in LLMs through procedurally generated, programmatically verified problems that test both allocentric and egocentric reasoning across varying complexity levels. Our results show a wide variance in performance among the most popular models available, with decline in accuracies with increasing problem complexity.

In follow-up work, we plan to explore LLM capabilities in solving these problems by generating code or doing visual chain-of-thought reasoning.

If your project relies on high-quality, expert-verified data or you’re building models that need to reason across complex domains, we’d love to collaborate. Come talk to us — and let’s push the boundaries of reasoning together.

Share this article
Image
Harit Vishwakarma
Research Intern

Harit Vishwakarma is a Research Intern at Snorkel AI, focusing on evaluating and improving the reasoning capabilities of large language models. He recently completed his PhD in Computer Science at the University of Wisconsin–Madison. His research centers on studying and developing methods for reliable inference and leveraging them for automated data labeling and enhancing performance at test time. Next, he is off to the University of Oxford for a postdoc.

Recommended articles

View all articles
agentic-in-action
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on
June 5, 2026
Snorkel Team
collab-gym-thumbnail
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control.
June 4, 2026
Alexis Sobel
Image
Benchtalks #2: The future of coding benchmarks
For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel
June 3, 2026
Vincent Sunn Chen
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.