Archived

Terminal-Bench 2.0

A benchmark for terminal agents featuring 89 hard, human-verified tasks in containerized environments, evaluated via task resolution rate with contributions built by a community of nearly 100 developers.

Built with

Overview

Terminal-Bench is a joint project between Stanford University and Laude Institute. The original benchmark passed 1,000 GitHub stars and drew contributions from nearly 100 developers worldwide before the 2.0 release raised the bar with 89 carefully curated tasks designed to keep frontier-model performance under the 50% ceiling.

Each task runs in a unique Docker container with a human-written oracle solution and tests that verify the final container state. The 2.0 release dropped easier tasks (like the original "Hello World" debugger), eliminated unreproducible items (like the YouTube-download task affected by changing anti-bot protections), and tightened specifications so that near-100% performance is attainable for sufficiently capable agents.

Leaderboard

Rank	Agent	Model	Date	Agent Org	Model Org	Accuracy

Inside the leaderboard

All 142 published submissions on Terminal-Bench 2.0 at a glance — how scores distribute, where the median sits, and which providers cluster at the top.

142

Published Entries

44%

Median Score

28.2%

Below 30%

15.5%

70% or Above

Score distribution · count of submissions per 10-point bucket

6

13

21

17

20

21

22

15

7

0-10

10-20

20-30

30-40

40-50

50-60

60-70

70-80

80-90

Task resolution rate

Most submissions cluster in the 50–70% band; only 3 cross 80%.

Provider concentration in top 30

Tagged by backbone model, not by agent harness

OpenAI

13

Anthropic

9

Multiple

4

Google

4

OpenAI

13 / 30

43.3%

Anthropic

9 / 30

30%

Multiple

4 / 30

13.3%

Google

4 / 30

13.3%

How tight is the top 10?

Each entry on Terminal-Bench is reported with a 95% confidence interval. Visualized on a single axis, the top 10 windows overlap heavily — the rank order is real, but the gaps are smaller than they look.

Rank	Agent	Model	Accuracy
1	NexAU-AHE	GPT-5.5	84.7% ±2.1
2	LemonHarness	Multiple	84.5% ±2.6
3	Capy	GPT-5.5	83.1% ±2.1
4	Codex CLI	GPT-5.5	82.2% ±2.2
5	Polaris	Multiple	82.2% ±2.8
6	WOZCODE	Claude Opus 4.7	80.2% ±2.1
7	TongAgents	Gemini 3.1 Pro	80.2% ±2.6
8	LemonHarness	Multiple	79.9% ±3
9	SageAgent	GPT-5.3-Codex	78.4% ±2.2
10	Droid	GPT-5.3-Codex	77.3% ±2.2

The top 10 cluster within ~7 percentage points (77.3% to 84.7%), and the 95% confidence interval bands overlap heavily \u2014 rank order is real, but the gaps are smaller than they look.

Problem catalog · All 89 problems

All 89 tasks span 16 categories and three difficulty tiers. Each task runs in its own Docker container with a human-written oracle solution.

software-engineering

26

system-administration

9

scientific-computing

8

security

8

data-science

8

debugging

5

file-operations

5

model-training

4

mathematics

4

data-processing

4

machine-learning

3

games

1

personal-assistant

1

optimization

1

data-querying

1

video-processing

1

easy

4

medium

55

hard

30

adaptive-rejection-sampler

break-filter-js-from-html

Show all 89 problems

Methodology

METRIC

Task resolution rate, reported per submission with a 95% confidence interval. Tests verify the final container state only, not agent commands or intermediate steps.

environment

Fully containerized Docker environment. Each task includes a unique image with relevant packages and files pre-installed, plus a time limit.

verification

Every task in 2.0 was reviewed for reproducibility, specification quality, and solvability. Tasks that were unreproducible (like the original YouTube-download) or arbitrarily threshold-gated were removed.

agents

Submissions pair a backbone model with an agent scaffold (Codex CLI, Terminus 2, Mini-SWE-Agent, Claude Code, and others). Each Agent + Model combination is its own leaderboard row.

From the blog

Image for Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Data development

Research

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Terminal-Bench 2.0 launches today, marking a major leap in AI agent evaluation. Snorkel AI contributed key research and task design to this release.

Kobie Crawford

November 7, 2025

Resources

Github

Website

Blog

Acknowledgments

Led by Stanford University and Laude Institute, with contributions from a community of nearly 100 developers including Snorkel AI as one of the top external contributors. Snorkel's team contributed in three areas: a systematic difficulty assessment applied across all contributed tasks, an extended failure-mode analysis with traces collected from frontier models, and tasks added to the registry.

Get notified when we launch a new benchmark

Your browser is currently blocking scripts, which prevents the form from loading.
Please enable scripts and refresh the page to continue.

Share this benchmark

Terminal-Bench 2.0

Leaderboard

Inside the leaderboard

How tight is the top 10?

Problem catalog · All 89 problems

Methodology

From the blog

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Resources

Acknowledgments

Get notified when we launch a new benchmark

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?