Archived

SnorkelSpatial

A procedurally-generated benchmark for evaluating allocentric and egocentric spatial reasoning capabilities in LLMs.

Overview

Large language models (LLMs) show remarkable results on solving complex reasoning problems across domains — from mathematical proofs and logical puzzles to graduate-level science and engineering questions. However, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday and scientific tasks involving geometry, diagrams, and spatial relations. To expand our understanding of spatial reasoning capabilities of LLMs, we design a simple spatial reasoning benchmark with a variety of problems based on a 2D grid world.

Leaderboard

Rank	Model	Score
1	GPT-5.4	99%
2	Grok 4 Fast Reasoning	84.85%
3	o3	76.67%
4	gpt-5	73.94%
5	gpt-oss-120b	52.73%
6	gpt-5-mini	45.45%
7	Claude Opus 4.1	45.15%
8	Magistral Medium 1.2	44.24%
9	Claude Opus 4	40.3%
10	o3-mini	37.88%
11	Claude Sonnet 4	33.33%
12	gpt-5-nano	26.67%
13	Claude Sonnet 3.7	21.52%
14	Gemini 2.5 Flash	18.79%
15	Llama 4 Scout	15.45%
16	Gemini 2.5 Pro	15.15%
17	gpt-5-chat	14.85%
18	Mistral Large	14.85%
19	o4 mini	14.85%
20	GPT-4.1	14.55%
21	Llama 3.3 70B	14.55%
22	Mistral Medium 3.1	14.55%
23	Nova Micro	14.55%
24	Command R+	14.24%
25	Nova Premier	14.24%
26	Qwen 3 235B	13.94%
27	Codestral	13.64%
28	Nova Lite	13.33%
29	Grok 3	12.73%
30	Magistral Medium	12.42%
31	Llama 4 Maverick	12.12%
32	Nova Pro	12.12%
33	Command R	11.82%

Types of actions

We define a set of primitive actions that are applied on the board and the particles to change their states.

Movements

Board and particles can move FORWARD, BACKWARD, LEFT, or RIGHT one step with respect to their current orientation. Movement does not change the orientation of the board and particles. When the board moves in any direction, all the particles on it also move along with it in the same direction. Additionally, if the particle's movement causes it to cross the board's boundary, the position is wrapped around from the opposite side of the board.

Rotations

Board and particles can rotate by 0, 90, 180, 270 or 360 degrees counterclockwise with respect to their current orientation. When the board is rotated by a certain angle, all the particles on it are also rotated by the same angle. Particle rotations are with respect to their current orientation.

Types of queries

After performing a sequence of the above actions, we ask the LLM the following types of queries about the states of the board and particles to test its spatial reasoning capabilities. The absolute and relative queries are designed to test allocentric (absolute) and egocentric (relative) spatial reasoning, a fundamental dichotomy stemming from cognitive psychology. Questions are based on absolute location, tile queries (i.e. the tile on which a given particle is currently located at), absolute orientation, relative location and relative orientation.

Dataset

We evaluate models on a dataset of 330 samples, varying the number of actions selected from the set {10, 20, 50, 100, 200}. Intuitively, we expect the problems with a larger number of actions to be harder.

Sample task

The following abridged version of a sample is based on the above figure. The problems used for the evaluation are on a larger board size and have more actions. Note, only textual description is provided to the models, the figure above is only for illustration.

# Initial States
Two particles P1, P2 on board B1 (5×5, tiles 1–25, zigzag pattern). Board at (2.5, 2.5), facing north.
P1 at (1.5, 2.5) facing south.
P2 at (4.5, 2.5) facing east.

# Actions
1. Rotate P2 by 90 degrees
2. Move P1 BACKWARD 1 unit
3. Rotate P2 by 0 degrees
4. Rotate board B1 by 180 degrees

# Question
What is the orientation of particle P1 after all the actions?

# Response format: JSON
{ "particle_P1_orientation": "..." }

Methodology

metric

accuracy@1 over the full dataset; compared against programmatically obtained ground truth.

Input

Text-only description of initial board setup and action sequence. No visual input provided.

Output Format

Structured JSON. Each question is attempted up to 10 times; missing or invalid output in all trials is scored as incorrect.

Future Work

Exploring code generation and visual chain-of-thought as alternative reasoning paths.

Behind the benchmark

We design a benchmark that tests spatial reasoning through a controlled grid-based environment. The setup consists of a two-dimensional square board with a few particles placed on it. The board and the particles can move and rotate. A sequence of actions, such as rotations and translations, is applied to the board and particles. After these actions, the model is asked questions about the final positions and orientations of either board or particles. The complexity of the problems is controlled by changing the number of actions applied to the board and/or particles.

Get notified when we launch a new benchmark

Your browser is currently blocking scripts, which prevents the form from loading.
Please enable scripts and refresh the page to continue.

Share this benchmark