Terminal-Bench 2.0
Terminal-Bench is a joint project between Stanford University and Laude Institute. The original benchmark passed 1,000 GitHub stars and drew contributions from nearly 100 developers worldwide before the 2.0 release raised the bar with 89 carefully curated tasks designed to keep frontier-model performance under
the 50% ceiling.
Each task runs in a unique Docker container with a human-written oracle solution and tests that verify the final container state. The 2.0 release dropped easier tasks (like the original "Hello World" debugger), eliminated unreproducible items (like the YouTube-download task affected by changing anti-bot protections), and tightened specifications so that near-100% performance is attainable for sufficiently capable agents.
Leaderboard
| Rank | Agent | Model | Date | Agent Org | Model Org | Accuracy |
|---|
Inside the leaderboard
All 142 published submissions on Terminal-Bench 2.0 at a glance — how scores distribute, where the median sits, and which providers cluster at the top.
How tight is the top 10?
Each entry on Terminal-Bench is reported with a 95% confidence interval. Visualized on a single axis, the top 10 windows overlap heavily — the rank order is real, but the gaps are smaller than they look.
| Rank | Agent | Model |
70%
75%
80%
85%
90%
|
Accuracy |
|---|---|---|---|---|
| 1 | NexAU-AHE | GPT-5.5 |
84.7%
±2.1
|
|
| 2 | LemonHarness | Multiple |
84.5%
±2.6
|
|
| 3 | Capy | GPT-5.5 |
83.1%
±2.1
|
|
| 4 | Codex CLI | GPT-5.5 |
82.2%
±2.2
|
|
| 5 | Polaris | Multiple |
82.2%
±2.8
|
|
| 6 | WOZCODE | Claude Opus 4.7 |
80.2%
±2.1
|
|
| 7 | TongAgents | Gemini 3.1 Pro |
80.2%
±2.6
|
|
| 8 | LemonHarness | Multiple |
79.9%
±3
|
|
| 9 | SageAgent | GPT-5.3-Codex |
78.4%
±2.2
|
|
| 10 | Droid | GPT-5.3-Codex |
77.3%
±2.2
|
Problem catalog · All 89 problems
Methodology
Submissions pair a backbone model with an agent scaffold (Codex CLI, Terminus 2, Mini-SWE-Agent, Claude Code, and others). Each Agent + Model combination is its own leaderboard row.
From the blog
Acknowledgments
Led by Stanford University and Laude Institute, with contributions from a community of nearly 100 developers including Snorkel AI as one of the top external contributors. Snorkel's team contributed in three areas: a systematic difficulty assessment applied across all contributed tasks, an extended failure-mode analysis with traces collected from frontier models, and tasks added to the registry.
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.



