LEADERBOARDS
Benchmarks for what frontier AI hasn't solved
Our ability to measure AI has been outpaced by our ability to develop it. We close that evaluation gap with benchmarks built around the tasks today's agents still break down on.
partners
Agentic Coding
A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.
Top models
1
Claude Opus 4.6
65.2%
2
Claude Opus 4.5
58%
3
Claude Sonnet 4.5
57.6%
gpt-5
45.2%
Kimi-K2-Thinking
36.8%
Devstral 2
33.2%
Grok 4.1 Fast
25.2%
Qwen 3 Coder 480B
18.8%
Mistral Large 3
13.8%
SnorkelUnderwrite
An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.
Top models
1
GPT-5.4
91%
2
Claude Opus 4.1
86.3%
3
gpt-5
83.33%
Grok 3
78%
o4 mini
78%
Claude Opus 4
77%
o3
77%
Claude Sonnet 3.7
74.6%
Claude Sonnet 4
72.3%
gpt-5-mini
71.67%
Kimi-K2-Thinking
71.3%
GPT-4.1
70.6%
Gemini 2.5 Flash
61%
Nova Premier
57%
Gemini 2.5 Pro
56.3%
Nova Pro
52.3%
gpt-5-nano
47%
Llama 3.3 70B
46.3%
Llama 4 Maverick
46.3%
Llama 4 Scout
44.3%
o3-mini
44.3%
Nova Lite
40%
Mistral Large
38.3%
Codestral
34%
Nova Micro
31%
gpt-oss-120b
30%
Magistral Medium
29.3%
Command R+
25.7%
Qwen 3 235B
21.3%
Llama 3.1 405B
20%
Command R
15.3%
Finance Reasoning
A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions, through tool-calling and planning.
Top models
1
GPT-5.4
52%
2
Grok 4
53.1%
3
Claude Sonnet 3.7
51.89%
Claude Opus 4
48.1%
Gemini 3 Pro
46.84%
gpt-5-mini
46.8%
o4 mini
45.57%
Claude Opus 4.1
45.56%
GPT-4.1
44.3%
o3
43.04%
Grok 3
41.8%
Grok 4 Fast Reasoning
40.51%
NVIDIA Nemotron Super 49B v1.5
35.443%
Kimi-K2-Thinking
35%
Gemini 2.5 Pro
34.6%
Nova Premier
34.17%
Gemini 2.5 Flash
32%
gpt-oss-120b
31.6%
o3-mini
30.37%
gpt-5-nano
26.6%
Qwen 3 235B
17.7%
Magistral Medium
13.92%
Nova Pro
12.65%
Mistral Large
10.12%
SnorkelSequences
A procedurally-generated and expert-verified benchmark for evaluating mathematical reasoning and compositional capabilities in LLMs.
Top models
1
gpt-5
77.6%
2
gpt-5-mini
77.6%
3
gpt-5-nano
72%
Gemini 2.5 Flash
70.8%
Claude Sonnet 4
70.4%
Grok 4 Fast Reasoning
70.2%
o4 mini
68.8%
NVIDIA Nemotron Super 49B v1.5
66.8%
Gemini 2.5 Pro
66%
Claude Opus 4
65.6%
o3
65.2%
Grok 4
63.2%
Llama 4 Maverick
62%
Nova Premier
51.8%
Llama 4 Scout
48.4%
Claude Sonnet 3.7
47.6%
Magistral Medium
47.6%
Nvidia nemotron super 49B
44.8%
Nova Pro
41.2%
Nova Lite
40%
Grok 3
39.2%
Llama 3.3 70B
38.8%
Mistral Large
38.8%
Codestral
38.4%
GPT-4.1
36.8%
Nvidia 70B Instruct
36.4%
Kimi-K2-Thinking
36%
Llama 3.1 405B
35.2%
Nova Micro
33.6%
Qwen 3 235B
28%
SnorkelSpatial
A procedurally-generated benchmark for evaluating allocentric and egocentric spatial reasoning capabilities in LLMs.
Top models
1
GPT-5.4
99%
2
Grok 4 Fast Reasoning
84.85%
3
o3
76.67%
gpt-5-mini
45.45%
Claude Opus 4.1
45.15%
Magistral Medium 1.2
44.24%
Claude Opus 4
40.3%
o3-mini
37.88%
Claude Sonnet 4
33.33%
gpt-5-nano
26.67%
Claude Sonnet 3.7
21.52%
Gemini 2.5 Flash
18.79%
Llama 4 Scout
15.45%
Gemini 2.5 Pro
15.15%
gpt-5-chat
14.85%
Mistral Large
14.85%
o4 mini
14.85%
GPT-4.1
14.55%
Llama 3.3 70B
14.55%
Mistral Medium 3.1
14.55%
Nova Micro
14.55%
Command R+
14.24%
Nova Premier
14.24%
Qwen 3 235B
13.94%
Codestral
13.64%
Nova Lite
13.33%
Grok 3
12.73%
Magistral Medium
12.42%
Llama 4 Maverick
12.12%
Nova Pro
12.12%
Command R
11.82%
SnorkelWordle
A benchmark designed to evaluate linguistic reasoning and instruction-following capabilities in language models through the iterative and constrained gameplay of Wordle.
Top models
1
gpt-5
94%
2
Grok 4
93%
3
o3
92.9%
gpt-5-mini
91%
o3-mini
90%
Grok 4 Fast Reasoning
88%
Claude Opus 4
85.6%
Kimi-K2-Thinking
85%
Claude Sonnet 4
83%
gpt-oss-120b
81.6%
gpt-5-nano
79%
Gemini 2.5 Pro
74%
Grok 3
71%
Claude Sonnet 3.7
68%
gpt-oss-20b
65.9%
GPT-4.1
62%
Gemini 2.5 Flash
61.9%
Kimi-K2
54%
Llama 3.3 70B
10.2%
SnorkelGraph
A procedurally-generated and expert verified benchmark for evaluating mathematical and spatial reasoning capabilities of LLMs through graph reasoning problems.
Top models
1
GPT-5.4
84.5%
2
Grok 4 Fast Reasoning
75%
3
o4 mini
75%
o3
71.5%
o3-mini
71%
Claude Opus 4
64.5%
Grok 3
64%
GPT-4.1
63%
gpt-5-nano
62.5%
Qwen 3 235B
61.5%
Grok 4
61%
Claude Sonnet 4
58%
Gemini 2.5 Pro
58%
Gemini 2.5 Flash
55%
Magistral Medium
53.5%
Claude Sonnet 3.7
50%
Nova Premier
34.5%
Llama 4 Maverick
34%
Mistral Large
30%
Nvidia nemotron super 49B
29%
Nova Pro
28%
Llama 4 Scout
26%
Codestral
24.5%
Llama 3.3 70B
23.5%
Nvidia 70B Instruct
22.5%
Llama 3.1 405B
20.5%
Nova Lite
19%
Nova Micro
17.5%
Command R+
15%
Command-Light
10.5%
Command
10%
SnorkelFinance
A benchmark of expert-verified financial QA created from financial reports for evaluating AI agents on tool-calling and reasoning capabilities.
Top models
1
gpt-5
81%
2
o3
81%
3
Gemini 3 Pro
80.34%
Claude Opus 4
78.3%
Claude Sonnet 3.7
77.9%
Claude Sonnet 4
76.6%
o4 mini
76.6%
Grok 4
74.04%
Grok 4 Fast Reasoning
73.45%
Kimi-K2-Thinking
71.7%
gpt-oss-120b
66.6%
Grok 3
65.86%
o3-mini
63.79%
GPT-4.1
62.7%
Nova Premier
62.06%
Gemini 2.5 Pro
60.6%
Gemini 2.5 Flash
53.1%
Qwen 3 235B
51.37%
gpt-5-nano
50%
NVIDIA Nemotron Super 49B v1.5
44%
Nova Pro
40.34%
Codestral
27.6%
Nova Lite
16.89%
Magistral Medium
16.2%
Nova Micro
14.48%
Mistral Large
13.4%
IN DEVELOPMENT
Open Benchmarks Grants
Backed by a $3M commitment, our Open Benchmarks Grants program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI is built and evaluated.
featured Collaborations
Get notified when we launch a new benchmark
Your browser is currently blocking scripts, which prevents the form form loading.
Please enable scripts and refresh the page to continue.
Please enable scripts and refresh the page to continue.
Looking ahead
Three core dimensions where today's benchmarks fall short
Benchmarks must close the gap between what we measure and what agents actually encounter. Our work focuses on three dimensions where today’s evaluations break down.
01
Environment complexity
How dynamic is the operating environment?
Real systems are far more complex than today's benchmarks.
02
Autonomy horizon
How independently can the agent operate before reliability breaks down?
03
Output complexity
How sophisticated is the deliverable agents must produce?



