Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Terminal-Bench, a joint project between Stanford University and Laude Institute, has quickly become the gold standard for evaluating AI coding agents’ capabilities in command-line environments. Since its launch earlier this year, the benchmark has garnered over 1,000 GitHub stars and attracted contributions from nearly 100 developers worldwide. At Snorkel AI, we’re proud to be one of the top external contributors to this project, and we’re thrilled to see the release of Terminal-Bench 2.0–a significant leap forward in both difficulty and quality.

Why Terminal-Bench 2.0 matters

From the beginning, Terminal-Bench was designed as a living benchmark that would evolve alongside AI capabilities. As frontier models improved, with performance climbing over 50% on the original benchmark, the team knew it was time to raise the stakes. Terminal-Bench 2.0 delivers on this vision with 89 carefully curated tasks that push the boundaries of what AI agents can accomplish in terminal environments. Each task underwent rigorous verification, with the development team meticulously reviewing every challenge to ensure it meets the highest standards.

What’s new in 2.0

Increased difficulty: Terminal-Bench 2.0 is substantially more challenging than its predecessor. Tasks now better represent the frontier challenges that distinguish truly capable agents from those that can only handle routine operations. This ensures the benchmark remains relevant as AI capabilities advance—maintaining the crucial 50% performance ceiling where there’s clear room for improvement while still delivering meaningful signal for evaluation.

Enhanced verification: One of the most significant improvements in 2.0 is the dramatic increase in task quality and reproducibility. The original benchmark included several problematic tasks—some were unsolvable for artificial reasons, others set arbitrary thresholds, and a few lacked robustness (like the YouTube download task affected by changing anti-bot protections). Terminal-Bench 2.0 eliminates these problems. Every task is now reproducible, properly specified, and genuinely solvable, with the team confident that near-100% performance is attainable for sufficiently capable agents.

Real-world impact: Terminal-Bench 2.0 removes tasks that don’t reflect valuable real-world work. The easy “Hello World” debugging task is gone, along with other challenges that, while interesting academically, don’t provide meaningful details about an agent’s economic impact or ability to perform valuable tasks that engineers actually get paid to do.

Snorkel AI’s contributions

Snorkel’s researchers joined other contributors to Terminal-Bench to help the project achieve its goals of delivering a thoroughly vetted, challenging dataset. The team’s main contributions to the project centered on three areas:

Reliable difficulty assessment: Our research team leveraged its expertise to develop a systematic assessment of Terminal-Bench 2.0 task difficulty, providing consistent criteria that could be applied across all contributed tasks.

Extended failure mode analysis: We developed a failure taxonomy and collected traces to better understand where LLMs fail when executing Terminal-Bench tasks. These insights can help with better task design and inform how to improve agents and models under test.

Tasks: The Snorkel team is a top contributor to the Terminal-Bench registry, and we’re pleased that we have been able to provide tasks to the Terminal-Bench 2.0 dataset.

The next level: Introducing Harbor

In addition to Terminal-Bench 2.0, the team has just announced an exciting new project named Harbor (site | repo). Harbor represents a significant evolution in how developers can scale up containerized AI agent environments. Born from observing how the community actually used Terminal-Bench, Harbor abstracts away the complexities of container-based rollouts—a challenge that seems simple but quickly grows complex at scale.

The framework emerged after the Terminal-Bench team noticed users deploying the benchmark in unexpected ways—for example, as CI/CD tests for agents, for reinforcement learning with synthetic tasks, and for prompt optimization. All these use cases had the same abstraction in common: containerized environments performing rollouts that return tokens and rewards. Harbor makes this pattern accessible with minimal code, allowing developers to scale from local testing to thousands of parallel containers across neocloud providers like Daytona, E2B, and Modal, or on self-managed Kubernetes clusters.

Better together: the benchmark and the framework

Terminal-Bench 2.0 represents more than just a version update—it’s a commitment to maintaining the highest quality evaluation infrastructure as AI agent capabilities increase. By prioritizing rigorous verification and meaningful difficulty over arbitrary metrics, the benchmark ensures that progress on Terminal-Bench translates to genuine improvements in AI agents’ ability to perform complex, real-world tasks.

And now, with Harbor, practitioners can parallelize the execution of entire datasets for faster iteration and greater efficiency. For teams building the next generation of AI coding assistants, Terminal-Bench 2.0 and Harbor provide an evaluation framework with the necessary robustness and reproducibility to measure true progress in this rapidly evolving space.

At Snorkel, we enthusiastically support the Terminal-Bench community and are confident that reproducible, containerized environments will accelerate the development of accurate and reliable agentic AI systems. If you need to build or improve your agents with expert-verified data in an RL environment, come talk to us!

Terminal-Bench is an open-source project led by Stanford University and Laude Institute, with contributions from a vibrant community of individuals and organizations, including Snorkel AI. To learn more about the benchmark, contribute tasks, or evaluate your own agents, visit tbench.ai or join the project’s Discord community. To find out more about Snorkel’s data development platform and our work with frontier AI labs, visit us at snorkel.ai and connect with our team.

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Why Terminal-Bench 2.0 matters

What’s new in 2.0

Snorkel AI’s contributions

The next level: Introducing Harbor

Better together: the benchmark and the framework

Recommended
articles

Introducing the Snorkel Agentic Coding Benchmark

2026: The year of environments

Part V: Future direction and emerging trends

Join our newsletter for expert advice, the latest research, and exclusive events.

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Why Terminal-Bench 2.0 matters

What’s new in 2.0

Snorkel AI’s contributions

The next level: Introducing Harbor

Better together: the benchmark and the framework

Recommended articles

Introducing the Snorkel Agentic Coding Benchmark

2026: The year of environments

Part V: Future direction and emerging trends

Join our newsletter for expert advice, the latest research, and exclusive events.

Recommended
articles