Evaluating Coding Agents with Terminal-Bench 2.0

Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing to the development of the upcoming Terminal-Bench 2.0, bringing the expertise gained through developing our own benchmarks to help Terminal-Bench continue to challenge the latest agents, powered by the best frontier models.

Why Terminal-Bench Matters for the Future of Coding Assistants

The terminal has quietly become the backbone of modern coding agent interactions. Whether it’s Claude Code, Codex, Cursor, Devin, or any of the other great tools out there, today’s most powerful coding assistants increasingly rely on command-line interfaces to perform complex tasks. This isn’t coincidental—the terminal represents a perfect convergence of power, flexibility, and the text-based modality where language models excel.

Through the power and concise syntax of CLI applications, terminal environments offer AI agents unprecedented control over computing resources, from launching cloud instances to managing complex data pipelines. But, as any sudoer knows, with great power comes great responsibility. Commands like rm -rf ~ serve as stark reminders that we need robust benchmarks to understand the limits and capabilities of terminal-based agents before deploying them in production environments.

Advancing the Frontier with Complex Environments

Terminal-Bench is composed of a collection of hand-crafted and human-verified tasks for agents in the terminal, and a framework for reliable, repeatable execution of each task. Each task comes with a dedicated Docker environment, human-verified solution, and set of test cases to check the agent’s solution. The benchmark covers diverse scenarios including scientific workflows, network configuration, cybersecurity vulnerabilities, and data analysis pipelines.

The benchmark is challenging, even for the most advanced agents and models: OpenAI’s Codex, powered by the gpt-5-codex model, has a verified score of 42.8%. It reveals significant limitations in chaining multiple terminal commands together, reasoning over long command outputs, and executing tasks safely within sensible limits. These results highlight both the benchmark’s rigorous standards and the substantial room for improvement in current agent capabilities.

What makes Terminal-Bench particularly valuable is its focus on real-world complexity. Unlike traditional coding benchmarks that test isolated functions or algorithms, Terminal-Bench evaluates agents on complete, end-to-end tasks that mirror the challenges faced by actual software engineers and system administrators. These should be the types of tasks that experienced developers would take hours or days to solve. Tasks range from compiling exotic code from source and fixing broken Python environments to implementing end-to-end data processing pipelines.

Terminal-Bench’s Growing Industry Recognition

The benchmark’s influence has grown exponentially since its launch. Terminal-Bench has garnered 800 stars on GitHub and attracted contributions from nearly 100 developers, with discussions revealing appreciation for the benchmark’s emphasis on real-world, practical scenarios rather than isolated code snippets. More importantly, it’s being cited as one of the most important benchmarks for AI coding assistants across the industry; for example, Terminal-Bench scores are now reported on the model cards for DeepSeek-V3.1-Terminus and Qwen3-Coder, and are included in the Claude Sonnet 4.5 release announcement.

Snorkel’s Role in Terminal-Bench 2.0 Development

As a leader in providing expert-verified datasets to frontier AI labs, Snorkel brings unique capabilities to the Terminal-Bench 2.0 development process. Our involvement with the benchmark extends beyond simply contributing tasks. We’re working closely with the Terminal-Bench team to explore how best to calibrate the difficulty of the tasks that are contributed to the benchmark, and provide more sophisticated analysis of model performance.

At Snorkel, we’re proud to contribute to this effort. The benchmark’s emphasis on complete task execution rather than isolated code snippets, combined with its focus on system architecture, dependency management, and environment configuration, captures skills that separate experienced engineers from junior developers. As Terminal-Bench 2.0 expands its suite of tasks and evaluation techniques, it will provide an increasingly comprehensive assessment of agentic capabilities.

Terminal-Bench is an open-source project led by Stanford University and Laude Institute in collaboration with external contributors including Snorkel AI. To learn more about the benchmark, contribute tasks, or evaluate your own agents, visit tbench.ai or join the project’s Discord community. To find out more about Snorkel’s data development platform and our work with frontier AI labs, visit us at snorkel.ai and connect with our team.

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Why Terminal-Bench Matters for the Future of Coding Assistants

Advancing the Frontier with Complex Environments

Terminal-Bench’s Growing Industry Recognition

Snorkel’s Role in Terminal-Bench 2.0 Development

Recommended
articles

The self-critique paradox: Why AI verification fails where it’s needed most

A chat with the Terminal-Bench team

Intelligence per watt: A new metric for AI’s future

Join our newsletter for expert advice, the latest research, and exclusive events.

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Why Terminal-Bench Matters for the Future of Coding Assistants

Advancing the Frontier with Complex Environments

Terminal-Bench’s Growing Industry Recognition

Snorkel’s Role in Terminal-Bench 2.0 Development

Recommended articles

The self-critique paradox: Why AI verification fails where it’s needed most

A chat with the Terminal-Bench team

Intelligence per watt: A new metric for AI’s future

Join our newsletter for expert advice, the latest research, and exclusive events.

Recommended
articles