Back to benchmarks
Released April 04, 2026
Archived

SnorkelWordle

A benchmark designed to evaluate linguistic reasoning and instruction-following capabilities in language models through the iterative and constrained gameplay of Wordle.
Overview

The Snorkel Wordle Benchmark is a diagnostic task that tests language models on their ability to solve the popular word game Wordle on 100 of low-occurrence words in English language. The game requires LLMs to guess a hidden five-letter word within six attempts, receiving feedback after each guess in the form of partial matches (correct letter, wrong position) and exact matches (correct letter, correct position).

This is a constrained reasoning problem in linguistics where success depends on memory, planning, linguistic knowledge (such as orthography and common word structures), and the ability to incorporate
feedback iteratively.

Leaderboard

Rank Model Score
1 gpt-5
94%
2 Grok 4
93%
3 o3
92.9%
4 o4 mini
91.9%
5 Gemini 3 Pro
91%
6 gpt-5-mini
91%
7 o3-mini
90%
8 Grok 4 Fast Reasoning
88%
9 Claude Opus 4
85.6%
10 Kimi-K2-Thinking
85%
11 Claude Sonnet 4
83%
12 gpt-oss-120b
81.6%
13 gpt-5-nano
79%
14 Gemini 2.5 Pro
74%
15 Grok 3
71%
16 Claude Sonnet 3.7
68%
17 gpt-oss-20b
65.9%
18 GPT-4.1
62%
19 Gemini 2.5 Flash
61.9%
20 Kimi-K2
54%
21 Llama 3.3 70B
10.2%

Data sample

MODEL
claude-opus-4-20250514
solved in
4 Turns
cost of this completion
$0.23

<think>Since this is my first guess, I don't have any feedback yet to work with. I should start with a word that contains common letters and vowels to maximize the information I can gather. A good starting word should ideally contain:
- Common vowels (like A, E, I, O)
- Common consonants (like R, S, T, L, N)
- No repeated letters (to test more letters at once)

Some popular starting words are "ADIEU" (contains 4 vowels), "SLATE" (common letters), "CRANE" (good mix), or "STARE" (very common letters).

I'll go with "SLATE" as it contains very common letters: S, L, A, T, E, which appear frequently in English words. This will help me identify or eliminate several common letters in one guess.</think>

<answer>slate</answer>

S
L
A
T
E

Methodology

primary metric
Win Rate: percentage of games where the model successfully identifies the target word within six guesses.
word set
100 low-frequency English words selected using the Google Books Ngram Corpus to avoid trivial or overfit targets.
diagnostics

Common First Guesses and Linguistic Plausibility (invalid guess rate, measured against NLTK corpus and PyEnchant).

grading
Regex-based pattern matching on both intermediate guesses and final answers for consistent automated scoring.

Behind the benchmark

This benchmark is designed to probe:
01
Linguistic reasoning
Does the model understand how English words are formed and constrained?
02
Iterative instruction following
Can the model refine its strategy based on feedback after each guess?
03
Search space navigation
How efficiently can a model prune and explore word candidates?
Through this benchmark, we gain insights into how well models handle multi-turn reasoning tasks with intermediate feedback and limited action spaces. We understand crucial differences in capabilities between reasoning and non-reasoning models.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.