SnorkelWordle

A benchmark designed to evaluate linguistic reasoning and instruction-following capabilities in language models through the iterative and constrained gameplay of Wordle.

Overview

The Snorkel Wordle Benchmark is a diagnostic task that tests language models on their ability to solve the popular word game Wordle on 100 of low-occurrence words in English language. The game requires LLMs to guess a hidden five-letter word within six attempts, receiving feedback after each guess in the form of partial matches (correct letter, wrong position) and exact matches (correct letter, correct position). This is a constrained reasoning problem in linguistics where success depends on memory, planning, linguistic knowledge (such as orthography and common word structures), and the ability to incorporate feedback iteratively.

This benchmark is designed to probe:
• Linguistic Reasoning: Does the model understand how English words are formed and constrained?
• Iterative Instruction Following: Can the model refine its strategy based on feedback after each guess?
• Search Space Navigation: How efficiently can a model prune and explore word candidates?

Through this benchmark, we gain insights into how well models handle multi-turn reasoning tasks with intermediate feedback and limited action spaces. We understand crucial differences in capabilities between reasoning and non-reasoning models.

Data Sample

Mode: claude-opus-4-20250514
Cost of this completion: $0.23
Solved in: 4 Turns

Turn 1

<think>Since this is my first guess, I don't have any feedback yet to work with. I should start with a word that contains common letters and vowels to maximize the information I can gather. A good starting word should ideally contain:
- Common vowels (like A, E, I, O)
- Common consonants (like R, S, T, L, N)
- No repeated letters (to test more letters at once)

Some popular starting words are "ADIEU" (contains 4 vowels), "SLATE" (common letters), "CRANE" (good mix), or "STARE" (very common letters).

I'll go with "SLATE" as it contains very common letters: S, L, A, T, E, which appear frequently in English words. This will help me identify or eliminate several common letters in one guess.</think>
<answer>slate</answer>

Wordle game after turn 1

Turn 2

<think>Let me analyze the feedback for "slate" → YBYYB:
- S (position 1) → Y: S is in the word but not in position 1
- L (position 2) → B: L is not in the word at all
- A (position 3) → Y: A is in the word but not in position 3
- T (position 4) → Y: T is in the word but not in position 4
- E (position 5) → B: E is not in the word at all

So I know …

Wordle game after turn 2

…

Turn 4

…
Let me think of 5-letter words that match A R T S _:
ARTSY - that works!

Let me verify:
- A in position 1 ✓
- R in position 2 ✓
- T in position 3 ✓
- S in position 4 ✓
- Y in position 5 (new letter) ✓</think>
<answer>artsy</answer>

Evaluation Methodology

Model performance is primarily measured using Win Rate — the percentage of games in which the model successfully identifies the target word within six guesses.

In addition to this core metric, we track several diagnostic trends to better understand model behavior:
• Common First Guesses: Analyzing the distribution of initial guesses across models helps identify reliance on heuristics or pre-learned strategies.
• Linguistic Plausibility: We measure the rate of invalid guesses, defined as words that are either not five letters long or not found in standard
English dictionaries (NLTK corpus, PyEnchat). This provides insight into the model's grasp of English orthography and vocabulary constraints.

Each model is evaluated on a fixed set of 100 target words. These words are selected from the lower-frequency end of the English lexicon, based on average usage data from the Google Books Ngram Corpus, to avoid overly common or trivial targets.

Both intermediate guesses and final answers are evaluated using regex-based pattern matching to ensure consistency and accuracy in automated grading.

gpt-5

94%

Grok 4

93%

92.9%

o4 mini

91.9%

Gemini 3 Pro

91%

gpt-5-mini

91%

o3-mini

90%

Grok 4 Fast Reasoning

88%

Claude Opus 4

85.6%

Kimi-K2-Thinking

85%

Claude Sonnet 4

83%

gpt-oss-120b

81.6%

gpt-5-nano

79%

Gemini 2.5 Pro

74%

Grok 3

71%

Claude Sonnet 3.7

68%

gpt-oss-20b

65.9%

GPT-4.1

62%

Gemini 2.5 Flash

61.9%

Kimi-K2

54%

Llama 3.3 70B

10.2%

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.

Talk to Snorkel