Data development
Research

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

September 30, 2025
4 min read

Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing to the development of the upcoming Terminal-Bench 2.0, bringing the expertise gained through developing our own benchmarks to help Terminal-Bench continue to challenge the latest agents, powered by the best frontier models.

Why Terminal-Bench Matters for the Future of Coding Assistants

The terminal has quietly become the backbone of modern coding agent interactions. Whether it’s Claude Code, Codex, Cursor, Devin, or any of the other great tools out there, today’s most powerful coding assistants increasingly rely on command-line interfaces to perform complex tasks. This isn’t coincidental—the terminal represents a perfect convergence of power, flexibility, and the text-based modality where language models excel.

Through the power and concise syntax of CLI applications, terminal environments offer AI agents unprecedented control over computing resources, from launching cloud instances to managing complex data pipelines. But, as any sudoer knows, with great power comes great responsibility. Commands like rm -rf ~ serve as stark reminders that we need robust benchmarks to understand the limits and capabilities of terminal-based agents before deploying them in production environments.

Advancing the Frontier with Complex Environments 

Terminal-Bench is composed of a collection of hand-crafted and human-verified tasks for agents in the terminal, and a framework for reliable, repeatable execution of each task. Each task comes with a dedicated Docker environment, human-verified solution, and set of test cases to check the agent’s solution. The benchmark covers diverse scenarios including scientific workflows, network configuration, cybersecurity vulnerabilities, and data analysis pipelines.

The benchmark is challenging, even for the most advanced agents and models: OpenAI’s Codex, powered by the gpt-5-codex model, has a verified score of 42.8%. It reveals significant limitations in chaining multiple terminal commands together, reasoning over long command outputs, and executing tasks safely within sensible limits. These results highlight both the benchmark’s rigorous standards and the substantial room for improvement in current agent capabilities.

What makes Terminal-Bench particularly valuable is its focus on real-world complexity. Unlike traditional coding benchmarks that test isolated functions or algorithms, Terminal-Bench evaluates agents on complete, end-to-end tasks that mirror the challenges faced by actual software engineers and system administrators. These should be the types of tasks that experienced developers would take hours or days to solve. Tasks range from compiling exotic code from source and fixing broken Python environments to implementing end-to-end data processing pipelines.

Terminal-Bench’s Growing Industry Recognition

The benchmark’s influence has grown exponentially since its launch. Terminal-Bench has garnered 800 stars on GitHub and attracted contributions from nearly 100 developers, with discussions revealing appreciation for the benchmark’s emphasis on real-world, practical scenarios rather than isolated code snippets. More importantly, it’s being cited as one of the most important benchmarks for AI coding assistants across the industry; for example, Terminal-Bench scores are now reported on the model cards for DeepSeek-V3.1-Terminus and Qwen3-Coder, and are included in the Claude Sonnet 4.5 release announcement.

Snorkel’s Role in Terminal-Bench 2.0 Development

As a leader in providing expert-verified datasets to frontier AI labs, Snorkel brings unique capabilities to the Terminal-Bench 2.0 development process. Our involvement with the benchmark extends beyond simply contributing tasks. We’re working closely with the Terminal-Bench team to explore how best to calibrate the difficulty of the tasks that are contributed to the benchmark, and provide more sophisticated analysis of model performance.

At Snorkel, we’re proud to contribute to this effort. The benchmark’s emphasis on complete task execution rather than isolated code snippets, combined with its focus on system architecture, dependency management, and environment configuration, captures skills that separate experienced engineers from junior developers. As Terminal-Bench 2.0 expands its suite of tasks and evaluation techniques, it will provide an increasingly comprehensive assessment of agentic capabilities.


Terminal-Bench is an open-source project led by Stanford University and Laude Institute in collaboration with external contributors including Snorkel AI. To learn more about the benchmark, contribute tasks, or evaluate your own agents, visit tbench.ai or join the project’s Discord community. To find out more about Snorkel’s data development platform and our work with frontier AI labs, visit us at snorkel.ai and connect with our team.

Share this article
Image
Jeong Shin
Research Engineer (Intern)

Jeong Shin completed her tenure as Research Intern at Snorkel AI in September 2025; her internship focused on building agentic evaluations. Before Snorkel, Jeong completed a masters degree with a focus on computer science from Stanford.

Image
Tom Walshe
Staff Research Scientist

Tom Walshe is a Staff Research Scientist at Snorkel AI. Before Snorkel, Tom worked in LegalTech and finance services, where he focussed on building end-to-end AI systems and researching data-centric AI. Prior to industry, Tom completed a PhD in Computer Science from the University of Oxford.

Recommended articles

View all articles
Image
Benchtalks #3: We taught AI everything except how to learn
For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration
June 25, 2026
Vincent Sunn Chen
alex-meta-scale-thumbnail
Agentic AI evaluation: Closing the gap with better benchmarks and data
Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to close that
June 23, 2026
Snorkel Team
judgment-bench
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based
June 18, 2026
Snorkel Team
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.