Back to benchmarks
Released April 04, 2026
Archived

SnorkelSpatial

A procedurally-generated benchmark for evaluating allocentric and egocentric spatial reasoning capabilities in LLMs.
Overview

Large language models (LLMs) show remarkable results on solving complex reasoning problems across domains — from mathematical proofs and logical puzzles to graduate-level science and engineering questions. However, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday and scientific tasks involving geometry, diagrams, and spatial relations. To expand our understanding of spatial reasoning capabilities of LLMs, we design a simple spatial reasoning benchmark with a variety of problems based on a 2D grid world.

Leaderboard

Rank Model Score
1 GPT-5.4
99%
2 Grok 4 Fast Reasoning
84.85%
3 o3
76.67%
4 gpt-5
73.94%
5 gpt-oss-120b
52.73%
6 gpt-5-mini
45.45%
7 Claude Opus 4.1
45.15%
8 Magistral Medium 1.2
44.24%
9 Claude Opus 4
40.3%
10 o3-mini
37.88%
11 Claude Sonnet 4
33.33%
12 gpt-5-nano
26.67%
13 Claude Sonnet 3.7
21.52%
14 Gemini 2.5 Flash
18.79%
15 Llama 4 Scout
15.45%
16 Gemini 2.5 Pro
15.15%
17 gpt-5-chat
14.85%
18 Mistral Large
14.85%
19 o4 mini
14.85%
20 GPT-4.1
14.55%
21 Llama 3.3 70B
14.55%
22 Mistral Medium 3.1
14.55%
23 Nova Micro
14.55%
24 Command R+
14.24%
25 Nova Premier
14.24%
26 Qwen 3 235B
13.94%
27 Codestral
13.64%
28 Nova Lite
13.33%
29 Grok 3
12.73%
30 Magistral Medium
12.42%
31 Llama 4 Maverick
12.12%
32 Nova Pro
12.12%
33 Command R
11.82%

Types of actions

We define a set of primitive actions that are applied on the board and the particles to change their states.

Movements

Board and particles can move FORWARD, BACKWARD, LEFT, or RIGHT one step with respect to their current orientation. Movement does not change the orientation of the board and particles. When the board moves in any direction, all the particles on it also move along with it in the same direction. Additionally, if the particle's movement causes it to cross the board's boundary, the position is wrapped around from the opposite side of the board.

Rotations

Board and particles can rotate by 0, 90, 180, 270 or 360 degrees counterclockwise with respect to their current orientation. When the board is rotated by a certain angle, all the particles on it are also rotated by the same angle. Particle rotations are with respect to their current orientation.
ImageImageImageImage

Types of queries

After performing a sequence of the above actions, we ask the LLM the following types of queries about the states of the board and particles to test its spatial reasoning capabilities. The absolute and relative queries are designed to test allocentric (absolute) and egocentric (relative) spatial reasoning, a fundamental dichotomy stemming from cognitive psychology. Questions are based on absolute location, tile queries (i.e. the tile on which a given particle is currently located at), absolute orientation, relative location and relative orientation.

Dataset

We evaluate models on a dataset of 330 samples, varying the number of actions selected from the set {10, 20, 50, 100, 200}. Intuitively, we expect the problems with a larger number of actions to be harder.

Sample task

The following abridged version of a sample is based on the above figure. The problems used for the evaluation are on a larger board size and have more actions. Note, only textual description is provided to the models, the figure above is only for illustration.

# Initial States
Two particles P1, P2 on board B1 (5×5, tiles 1–25, zigzag pattern). Board at (2.5, 2.5), facing north.
P1 at (1.5, 2.5) facing south.
P2 at (4.5, 2.5) facing east.

# Actions
1. Rotate P2 by 90 degrees
2. Move P1 BACKWARD 1 unit
3. Rotate P2 by 0 degrees
4. Rotate board B1 by 180 degrees

# Question
What is the orientation of particle P1 after all the actions?

# Response format: JSON
{ "particle_P1_orientation": "..." }

Methodology

metric
accuracy@1 over the full dataset; compared against programmatically obtained ground truth.
Input
Text-only description of initial board setup and action sequence. No visual input provided.
Output Format
Structured JSON. Each question is attempted up to 10 times; missing or invalid output in all trials is scored as incorrect.
Future Work
Exploring code generation and visual chain-of-thought as alternative reasoning paths.

Behind the benchmark

We design a benchmark that tests spatial reasoning through a controlled grid-based environment. The setup consists of a two-dimensional square board with a few particles placed on it. The board and the particles can move and rotate. A sequence of actions, such as rotations and translations, is applied to the board and particles. After these actions, the model is asked questions about the final positions and orientations of either board or particles. The complexity of the problems is controlled by changing the number of actions applied to the board and/or particles.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.