LEADERBOARDS

Benchmarks for what frontier AI hasn't solved

Our ability to measure AI has been outpaced by our ability to develop it. We close that evaluation gap with benchmarks built around the tasks today's agents still break down on.
partners
Standford logo
Image
Image
Image
Image
Image
Image
Image
agent-le-logo
Image
Image
Cua Logo
Open Benchmarks Grants

Agents’ Last Exam

Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.

Top Submissions (Pass Rate)
1
Image
Codex · GPT-5.5
24%
2
Image
ALE Claw · GPT-5.5
23%
3
Image
Claude Code · Claude-Fable-5
22%
Open Benchmarks Grants

Continual Learning Bench

Evaluates whether AI systems improve from prior experience across sequential, stateful tasks, measuring real in-context learning, not just raw capability.

Top Systems (Agg. Reward)
1
Image
ICL · Claude Sonnet 4.6
+0.223
2
Image
ICL · GPT-5.4
+0.201
3
Image
Claude Code · Sonnet 4.6
+0.190
Open Benchmarks Grants

SlopCode Bench

Measures code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat), and verbosity under realistic repo conditions.

Top Models by Iso Solve
1
Image
GPT-5.5
28.06%
2
Image
GPT-5.3-Codex
26.02%
3
Image
GPT-5.4
23.47%
Open Benchmarks Grants

Terminal-Bench 2.1

Terminal agent evaluation led by Stanford University and Laude Institute. v2.1 fixes 28 tasks from 2.0 and introduces continuous validation.

Top Submissions
1
Image
Codex CLI · GPT-5.5
83.4%
2
Image
Claude Code · Claude 5 Fable
83.1%
3
Image
Terminus 2 · Claude 5 Fable
80.4%

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

Top Models
1
Image
Claude Opus 4.6
65.2%
2
Image
Claude Opus 4.5
58.0%
3
Image
Claude Sonnet 4.5
57.6%

View 8 archived benchmarks
IN DEVELOPMENT

Open Benchmarks Grants

Backed by a $3M commitment, our Open Benchmarks Grants program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI is built and evaluated.

Get notified when we launch a new benchmark

Looking ahead

Three core dimensions where today's benchmarks fall short

Benchmarks must close the gap between what we measure and what agents actually encounter. Our work focuses on three dimensions where today’s evaluations break down.
01
Environment complexity
How dynamic is the operating environment? Real systems are far more complex than today's benchmarks.
02
Autonomy horizon
How independently can the agent operate before reliability breaks down?
03
Output complexity
How sophisticated is the deliverable agents must produce?

For models that need to be right. Not just good enough.