Archived

SnorkelWordle

A benchmark designed to evaluate linguistic reasoning and instruction-following capabilities in language models through the iterative and constrained gameplay of Wordle.

Overview

The Snorkel Wordle Benchmark is a diagnostic task that tests language models on their ability to solve the popular word game Wordle on 100 of low-occurrence words in English language. The game requires LLMs to guess a hidden five-letter word within six attempts, receiving feedback after each guess in the form of partial matches (correct letter, wrong position) and exact matches (correct letter, correct position).

This is a constrained reasoning problem in linguistics where success depends on memory, planning, linguistic knowledge (such as orthography and common word structures), and the ability to incorporate feedback iteratively.

Leaderboard

Rank	Model	Score
1	gpt-5	94%
2	Grok 4	93%
3	o3	92.9%
4	o4 mini	91.9%
5	Gemini 3 Pro	91%
6	gpt-5-mini	91%
7	o3-mini	90%
8	Grok 4 Fast Reasoning	88%
9	Claude Opus 4	85.6%
10	Kimi-K2-Thinking	85%
11	Claude Sonnet 4	83%
12	gpt-oss-120b	81.6%
13	gpt-5-nano	79%
14	Gemini 2.5 Pro	74%
15	Grok 3	71%
16	Claude Sonnet 3.7	68%
17	gpt-oss-20b	65.9%
18	GPT-4.1	62%
19	Gemini 2.5 Flash	61.9%
20	Kimi-K2	54%
21	Llama 3.3 70B	10.2%

Data sample

MODEL

claude-opus-4-20250514

solved in

4 Turns

cost of this completion

$0.23

<think>Since this is my first guess, I don't have any feedback yet to work with. I should start with a word that contains common letters and vowels to maximize the information I can gather. A good starting word should ideally contain:
- Common vowels (like A, E, I, O)
- Common consonants (like R, S, T, L, N)
- No repeated letters (to test more letters at once)

Some popular starting words are "ADIEU" (contains 4 vowels), "SLATE" (common letters), "CRANE" (good mix), or "STARE" (very common letters).

I'll go with "SLATE" as it contains very common letters: S, L, A, T, E, which appear frequently in English words. This will help me identify or eliminate several common letters in one guess.</think>

<answer>slate</answer>

Methodology

primary metric

Win Rate: percentage of games where the model successfully identifies the target word within six guesses.

word set

100 low-frequency English words selected using the Google Books Ngram Corpus to avoid trivial or overfit targets.

diagnostics

Common First Guesses and Linguistic Plausibility (invalid guess rate, measured against NLTK corpus and PyEnchant).

grading

Regex-based pattern matching on both intermediate guesses and final answers for consistent automated scoring.

Behind the benchmark

This benchmark is designed to probe:

Linguistic reasoning

Does the model understand how English words are formed and constrained?

Iterative instruction following

Can the model refine its strategy based on feedback after each guess?

Search space navigation

How efficiently can a model prune and explore word candidates?

Through this benchmark, we gain insights into how well models handle multi-turn reasoning tasks with intermediate feedback and limited action spaces. We understand crucial differences in capabilities between reasoning and non-reasoning models.

Get notified when we launch a new benchmark

Your browser is currently blocking scripts, which prevents the form from loading.
Please enable scripts and refresh the page to continue.

Share this benchmark