SnorkelWordle
Overview
The Snorkel Wordle Benchmark is a diagnostic task that tests language models on their ability to solve the popular word game Wordle on 100 of low-occurrence words in English language. The game requires LLMs to guess a hidden five-letter word within six attempts, receiving feedback after each guess in the form of partial matches (correct letter, wrong position) and exact matches (correct letter, correct position). This is a constrained reasoning problem in linguistics where success depends on memory, planning, linguistic knowledge (such as orthography and common word structures), and the ability to incorporate feedback iteratively.
This benchmark is designed to probe:
• Linguistic Reasoning: Does the model understand how English words are formed and constrained?
• Iterative Instruction Following: Can the model refine its strategy based on feedback after each guess?
• Search Space Navigation: How efficiently can a model prune and explore word candidates?
Through this benchmark, we gain insights into how well models handle multi-turn reasoning tasks with intermediate feedback and limited action spaces. We understand crucial differences in capabilities between reasoning and non-reasoning models.
Model Comparison
Data Sample
Mode: claude-opus-4-20250514Cost of this completion: $0.23
Solved in: 4 Turns
Turn 1
<think>Since this is my first guess, I don't have any feedback yet to work with. I should start with a word that contains common letters and vowels to maximize the information I can gather. A good starting word should ideally contain:
- Common vowels (like A, E, I, O)
- Common consonants (like R, S, T, L, N)
- No repeated letters (to test more letters at once)
Some popular starting words are "ADIEU" (contains 4 vowels), "SLATE" (common letters), "CRANE" (good mix), or "STARE" (very common letters).
I'll go with "SLATE" as it contains very common letters: S, L, A, T, E, which appear frequently in English words. This will help me identify or eliminate several common letters in one guess.</think>
<answer>slate</answer>

Turn 2
<think>Let me analyze the feedback for "slate" → YBYYB:
- S (position 1) → Y: S is in the word but not in position 1
- L (position 2) → B: L is not in the word at all
- A (position 3) → Y: A is in the word but not in position 3
- T (position 4) → Y: T is in the word but not in position 4
- E (position 5) → B: E is not in the word at all
So I know …

…
Turn 4
…
Let me think of 5-letter words that match A R T S _:
ARTSY - that works!
Let me verify:
- A in position 1 ✓
- R in position 2 ✓
- T in position 3 ✓
- S in position 4 ✓
- Y in position 5 (new letter) ✓</think>
<answer>artsy</answer>

Evaluation Methodology
Model performance is primarily measured using Win Rate — the percentage of games in which the model successfully identifies the target word within six guesses.
In addition to this core metric, we track several diagnostic trends to better understand model behavior:
• Common First Guesses: Analyzing the distribution of initial guesses across models helps identify reliance on heuristics or pre-learned strategies.
• Linguistic Plausibility: We measure the rate of invalid guesses, defined as words that are either not five letters long or not found in standard
English dictionaries (NLTK corpus, PyEnchat). This provides insight into the model's grasp of English orthography and vocabulary constraints.
Each model is evaluated on a fixed set of 100 target words. These words are selected from the lower-frequency end of the English lexicon, based on average usage data from the Google Books Ngram Corpus, to avoid overly common or trivial targets.
Both intermediate guesses and final answers are evaluated using regex-based pattern matching to ensure consistency and accuracy in automated grading.