LEADERBOARDS

Benchmarks for what frontier AI hasn't solved

Our ability to measure AI has been outpaced by our ability to develop it. We close that evaluation gap with benchmarks built around the tasks today's agents still break down on.

partners

Senior SWE-bench

A benchmark for evaluating coding agents on senior-level engineering work: building features from realistic instructions, investigating bugs that require runtime investigation, and shipping code that aligns to existing codebase conventions.

Built with

Tasteful Solve Rate

The top-performing frontier models fail to complete tasks with senior-level correctness and taste over 70% of the time.

Open Benchmarks Grants

OSWorld 2.0

Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.

By binary accuracy (300 steps)

Claude Opus 4.7 · max

13%

3

Claude Sonnet 4.6 · medium

8.3%

Open Benchmarks Grants

Agents’ Last Exam

Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.

Top Submissions (Pass Rate)

1

Codex · GPT 5.6-Sol · XHigh

30.6%

2

Codex · GPT 5.6-Sol · High

30.6%

3

Codex · GPT 5.6-Sol · Max

29.6%

Open Benchmarks Grants

Continual Learning Bench

Evaluates whether AI systems improve from prior experience across sequential, stateful tasks, measuring real in-context learning, not just raw capability.

Top Systems (Agg. Reward)

1

ICL · Claude Sonnet 4.6

Claude Code · Sonnet 4.6

+0.190

Open Benchmarks Grants

SlopCode Bench

Measures code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat), and verbosity under realistic repo conditions.

Top Models by Iso Solve

Open Benchmarks Grants

Terminal-Bench 2.1

Terminal agent evaluation led by Stanford University and Laude Institute. v2.1 fixes 28 tasks from 2.0 and introduces continuous validation.

Claude Code · Claude 5 Fable

83.1%

3

Terminus 2 · Claude 5 Fable

80.4%

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

View 8 archived benchmarks

IN DEVELOPMENT

Open Benchmarks Grants

Backed by a $3M commitment, our Open Benchmarks Grants program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI is built and evaluated.

featured Collaborations

Computer tasks

terminal-bench 3.0

Real terminal tasks — exposing where today's coding agents fail.

Natural sciences

terminal-bench-science

Generic code evals miss sloppy code. This measures what they ignore.

Legal agents

Harvey’s Long Horizon Legal Agent Benchmark

Built to evaluate and improve agent capabilities for supporting legal work.

Evaluation methods

JudgmentBench

Compares rubric-based and preference-based evaluation for judging output quality.

Get notified when we launch a new benchmark

Looking ahead

Three core dimensions where today's benchmarks fall short

Benchmarks must close the gap between what we measure and what agents actually encounter. Our work focuses on three dimensions where today’s evaluations break down.

Explore the eval gap

01

Environment complexity

How dynamic is the operating environment? Real systems are far more complex than today's benchmarks.

02

Autonomy horizon

How independently can the agent operate before reliability breaks down?

03

Output complexity

How sophisticated is the deliverable agents must produce?

For models that need to be right. Not just good enough.

Request dataset samples

Explore research

Benchmarks for what frontier AI hasn't solved

Senior SWE-bench

OSWorld 2.0

Agents’ Last Exam

Continual Learning Bench

SlopCode Bench

Terminal-Bench 2.1

Agentic Coding

Terminal-Bench 2.0

Finance Reasoning

SnorkelSequences

SnorkelFinance

SnorkelGraph

SnorkelUnderwrite

SnorkelWordle

SnorkelSpatial

Open Benchmarks Grants

Get notified when we launch a new benchmark

Three core dimensions where today's benchmarks fall short

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?