Back to benchmarks
Released November 07, 2025
Archived

Terminal-Bench 2.0

A benchmark for terminal agents featuring 89 hard, human-verified tasks in containerized environments, evaluated via task resolution rate with contributions built by a community of nearly 100 developers.
Built with
ImageImageharbor logoImage
Overview

Terminal-Bench is a joint project between Stanford University and Laude Institute. The original benchmark passed 1,000 GitHub stars and drew contributions from nearly 100 developers worldwide before the 2.0 release raised the bar with 89 carefully curated tasks designed to keep frontier-model performance under
the 50% ceiling.

Each task runs in a unique Docker container with a human-written oracle solution and tests that verify the final container state. The 2.0 release dropped easier tasks (like the original "Hello World" debugger), eliminated unreproducible items (like the YouTube-download task affected by changing anti-bot protections), and tightened specifications so that near-100% performance is attainable for sufficiently capable agents.

Leaderboard

Model
HARNESS
org
Rank Agent Model Date Agent Org Model Org Accuracy

Inside the leaderboard

All 142 published submissions on Terminal-Bench 2.0 at a glance — how scores distribute, where the median sits, and which providers cluster at the top.

142
Published Entries
44%
Median Score
28.2%
Below 30%
15.5%
70% or Above
Score distribution · count of submissions per 10-point bucket
6
13
21
17
20
21
22
15
7
0-10
10-20
20-30
30-40
40-50
50-60
60-70
70-80
80-90
Task resolution rate
Most submissions cluster in the 50–70% band; only 3 cross 80%.
Provider concentration in top 30
Tagged by backbone model, not by agent harness
OpenAI
13
Anthropic
9
Multiple
4
Google
4
OpenAI
13 / 30
43.3%
Anthropic
9 / 30
30%
Multiple
4 / 30
13.3%
Google
4 / 30
13.3%

How tight is the top 10?

Each entry on Terminal-Bench is reported with a 95% confidence interval. Visualized on a single axis, the top 10 windows overlap heavily — the rank order is real, but the gaps are smaller than they look.

Rank Agent Model
70% 75% 80% 85% 90%
Accuracy
1 NexAU-AHE GPT-5.5
84.7%
±2.1
2 LemonHarness Multiple
84.5%
±2.6
3 Capy GPT-5.5
83.1%
±2.1
4 Codex CLI GPT-5.5
82.2%
±2.2
5 Polaris Multiple
82.2%
±2.8
6 WOZCODE Claude Opus 4.7
80.2%
±2.1
7 TongAgents Gemini 3.1 Pro
80.2%
±2.6
8 LemonHarness Multiple
79.9%
±3
9 SageAgent GPT-5.3-Codex
78.4%
±2.2
10 Droid GPT-5.3-Codex
77.3%
±2.2
The top 10 cluster within ~7 percentage points (77.3% to 84.7%), and the 95% confidence interval bands overlap heavily \u2014 rank order is real, but the gaps are smaller than they look.

Problem catalog · All 89 problems

All 89 tasks span 16 categories and three difficulty tiers. Each task runs in its own Docker container with a human-written oracle solution.
software-engineering
26
system-administration
9
scientific-computing
8
security
8
data-science
8
debugging
5
file-operations
5
model-training
4
mathematics
4
data-processing
4
machine-learning
3
games
1
personal-assistant
1
optimization
1
data-querying
1
video-processing
1
easy
4
medium
55
hard
30
adaptive-rejection-sampler
scientific-computing
medium
By jvpoulos
bn-fit-modify
scientific-computing
hard
By Gabriel Dreiman
break-filter-js-from-html
security
medium
By Nicholas Carlini
build-cython-ext
debugging
medium
By Zizhao Chen
build-pmars
software-engineering
medium
By Jeong Shin
build-pov-ray
software-engineering
medium
By Jeong Shin
Show all 89 problems

Methodology

METRIC
Task resolution rate, reported per submission with a 95% confidence interval. Tests verify the final container state only, not agent commands or intermediate steps.
environment
Fully containerized Docker environment. Each task includes a unique image with relevant packages and files pre-installed, plus a time limit.
verification
Every task in 2.0 was reviewed for reproducibility, specification quality, and solvability. Tasks that were unreproducible (like the original YouTube-download) or arbitrarily threshold-gated were removed.
agents

Submissions pair a backbone model with an agent scaffold (Codex CLI, Terminus 2, Mini-SWE-Agent, Claude Code, and others). Each Agent + Model combination is its own leaderboard row.

From the blog

Image for Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Terminal-Bench 2.0 launches today, marking a major leap in AI agent evaluation. Snorkel AI contributed key research and task design to this release.
November 7, 2025

Acknowledgments

Led by Stanford University and Laude Institute, with contributions from a community of nearly 100 developers including Snorkel AI as one of the top external contributors. Snorkel's team contributed in three areas: a systematic difficulty assessment applied across all contributed tasks, an extended failure-mode analysis with traces collected from frontier models, and tasks added to the registry.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.