LEADERBOARDS

Benchmarks for what frontier AI hasn't solved

Our ability to measure AI has been outpaced by our ability to develop it. We close that evaluation gap with benchmarks built around the tasks today's agents still break down on.

partners

Open Benchmarks Grants

Agents’ Last Exam

Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.

Top Submissions (Pass Rate)

1

Codex · GPT-5.5

24%

2

ALE Claw · GPT-5.5

23%

3

Claude Code · Claude-Fable-5

22%

Open Benchmarks Grants

Continual Learning Bench

Evaluates whether AI systems improve from prior experience across sequential, stateful tasks, measuring real in-context learning, not just raw capability.

Top Systems (Agg. Reward)

1

ICL · Claude Sonnet 4.6

+0.223

2

ICL · GPT-5.4

+0.201

3

Claude Code · Sonnet 4.6

+0.190

Open Benchmarks Grants

SlopCode Bench

Measures code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat), and verbosity under realistic repo conditions.

Top Models by Iso Solve

1

GPT-5.5

28.06%

2

GPT-5.3-Codex

26.02%

3

GPT-5.4

23.47%

Open Benchmarks Grants

Terminal-Bench 2.1

Terminal agent evaluation led by Stanford University and Laude Institute. v2.1 fixes 28 tasks from 2.0 and introduces continuous validation.

Top Submissions

1

Codex CLI · GPT-5.5

83.4%

2

Claude Code · Claude 5 Fable

83.1%

3

Terminus 2 · Claude 5 Fable

80.4%

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

Top Models

1

Claude Opus 4.6

65.2%

2

Claude Opus 4.5

58.0%

3

Claude Sonnet 4.5

57.6%

View 8 archived benchmarks

IN DEVELOPMENT

Open Benchmarks Grants

Backed by a $3M commitment, our Open Benchmarks Grants program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI is built and evaluated.

featured Collaborations

Computer tasks

terminal-bench 3.0

Real terminal tasks — exposing where today's coding agents fail.

Natural sciences

terminal-bench-science

Generic code evals miss sloppy code. This measures what they ignore.

Legal agents

Harvey’s Long Horizon Legal Agent Benchmark

Built to evaluate and improve agent capabilities for supporting legal work.

Evaluation methods

JudgmentBench

Compares rubric-based and preference-based evaluation for judging output quality.

Get notified when we launch a new benchmark

Your browser is currently blocking scripts, which prevents the form from loading.
Please enable scripts and refresh the page to continue.

Looking ahead

Three core dimensions where today's benchmarks fall short

Benchmarks must close the gap between what we measure and what agents actually encounter. Our work focuses on three dimensions where today’s evaluations break down.

Explore the eval gap

01

Environment complexity

How dynamic is the operating environment? Real systems are far more complex than today's benchmarks.

02

Autonomy horizon

How independently can the agent operate before reliability breaks down?

03

Output complexity

How sophisticated is the deliverable agents must produce?

For models that need to be right. Not just good enough.

Request dataset samples

Explore research

Benchmarks for what frontier AI hasn't solved

Agents’ Last Exam

Continual Learning Bench

SlopCode Bench

Terminal-Bench 2.1

Agentic Coding

Terminal-Bench 2.0

Finance Reasoning

SnorkelSequences

SnorkelFinance

SnorkelGraph

SnorkelUnderwrite

SnorkelWordle

SnorkelSpatial

Open Benchmarks Grants

Get notified when we launch a new benchmark

Three core dimensions where today's benchmarks fall short

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?