SnorkelWordle
The Snorkel Wordle Benchmark is a diagnostic task that tests language models on their ability to solve the popular word game Wordle on 100 of low-occurrence words in English language. The game requires LLMs to guess a hidden five-letter word within six attempts, receiving feedback after each guess in the form of partial matches (correct letter, wrong position) and exact matches (correct letter, correct position).
This is a constrained reasoning problem in linguistics where success depends on memory, planning, linguistic knowledge (such as orthography and common word structures), and the ability to incorporate feedback iteratively.
Leaderboard
| Rank | Model | Score |
|---|---|---|
| 1 | gpt-5 |
94%
|
| 2 | Grok 4 |
93%
|
| 3 | o3 |
92.9%
|
| 4 | o4 mini |
91.9%
|
| 5 | Gemini 3 Pro |
91%
|
| 6 | gpt-5-mini |
91%
|
| 7 | o3-mini |
90%
|
| 8 | Grok 4 Fast Reasoning |
88%
|
| 9 | Claude Opus 4 |
85.6%
|
| 10 | Kimi-K2-Thinking |
85%
|
| 11 | Claude Sonnet 4 |
83%
|
| 12 | gpt-oss-120b |
81.6%
|
| 13 | gpt-5-nano |
79%
|
| 14 | Gemini 2.5 Pro |
74%
|
| 15 | Grok 3 |
71%
|
| 16 | Claude Sonnet 3.7 |
68%
|
| 17 | gpt-oss-20b |
65.9%
|
| 18 | GPT-4.1 |
62%
|
| 19 | Gemini 2.5 Flash |
61.9%
|
| 20 | Kimi-K2 |
54%
|
| 21 | Llama 3.3 70B |
10.2%
|
Data sample
<think>Since this is my first guess, I don't have any feedback yet to work with. I should start with a word that contains common letters and vowels to maximize the information I can gather. A good starting word should ideally contain:
- Common vowels (like A, E, I, O)
- Common consonants (like R, S, T, L, N)
- No repeated letters (to test more letters at once)
Some popular starting words are "ADIEU" (contains 4 vowels), "SLATE" (common letters), "CRANE" (good mix), or "STARE" (very common letters).
I'll go with "SLATE" as it contains very common letters: S, L, A, T, E, which appear frequently in English words. This will help me identify or eliminate several common letters in one guess.</think>
<answer>slate</answer>
Methodology
Common First Guesses and Linguistic Plausibility (invalid guess rate, measured against NLTK corpus and PyEnchant).

