Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark

Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing to the development of the upcoming Terminal-Bench 2.0, bringing the expertise gained through developing our own benchmarks to help Terminal-Bench continue to challenge the latest agents, powered by the best frontier models.
Why Terminal-Bench Matters for the Future of Coding Assistants
The terminal has quietly become the backbone of modern coding agent interactions. Whether it’s Claude Code, Codex, Cursor, Devin, or any of the other great tools out there, today’s most powerful coding assistants increasingly rely on command-line interfaces to perform complex tasks. This isn’t coincidental—the terminal represents a perfect convergence of power, flexibility, and the text-based modality where language models excel.
Through the power and concise syntax of CLI applications, terminal environments offer AI agents unprecedented control over computing resources, from launching cloud instances to managing complex data pipelines. But, as any sudoer knows, with great power comes great responsibility. Commands like rm -rf ~ serve as stark reminders that we need robust benchmarks to understand the limits and capabilities of terminal-based agents before deploying them in production environments.
Advancing the Frontier with Complex Environments
Terminal-Bench is composed of a collection of hand-crafted and human-verified tasks for agents in the terminal, and a framework for reliable, repeatable execution of each task. Each task comes with a dedicated Docker environment, human-verified solution, and set of test cases to check the agent’s solution. The benchmark covers diverse scenarios including scientific workflows, network configuration, cybersecurity vulnerabilities, and data analysis pipelines.
The benchmark is challenging, even for the most advanced agents and models: OpenAI’s Codex, powered by the gpt-5-codex model, has a verified score of 42.8%. It reveals significant limitations in chaining multiple terminal commands together, reasoning over long command outputs, and executing tasks safely within sensible limits. These results highlight both the benchmark’s rigorous standards and the substantial room for improvement in current agent capabilities.
What makes Terminal-Bench particularly valuable is its focus on real-world complexity. Unlike traditional coding benchmarks that test isolated functions or algorithms, Terminal-Bench evaluates agents on complete, end-to-end tasks that mirror the challenges faced by actual software engineers and system administrators. These should be the types of tasks that experienced developers would take hours or days to solve. Tasks range from compiling exotic code from source and fixing broken Python environments to implementing end-to-end data processing pipelines.
Terminal-Bench’s Growing Industry Recognition
The benchmark’s influence has grown exponentially since its launch. Terminal-Bench has garnered 800 stars on GitHub and attracted contributions from nearly 100 developers, with discussions revealing appreciation for the benchmark’s emphasis on real-world, practical scenarios rather than isolated code snippets. More importantly, it’s being cited as one of the most important benchmarks for AI coding assistants across the industry; for example, Terminal-Bench scores are now reported on the model cards for DeepSeek-V3.1-Terminus and Qwen3-Coder, and are included in the Claude Sonnet 4.5 release announcement.
Snorkel’s Role in Terminal-Bench 2.0 Development
As a leader in providing expert-verified datasets to frontier AI labs, Snorkel brings unique capabilities to the Terminal-Bench 2.0 development process. Our involvement with the benchmark extends beyond simply contributing tasks. We’re working closely with the Terminal-Bench team to explore how best to calibrate the difficulty of the tasks that are contributed to the benchmark, and provide more sophisticated analysis of model performance.
At Snorkel, we’re proud to contribute to this effort. The benchmark’s emphasis on complete task execution rather than isolated code snippets, combined with its focus on system architecture, dependency management, and environment configuration, captures skills that separate experienced engineers from junior developers. As Terminal-Bench 2.0 expands its suite of tasks and evaluation techniques, it will provide an increasingly comprehensive assessment of agentic capabilities.
Terminal-Bench is an open-source project led by Stanford University and Laude Institute in collaboration with external contributors including Snorkel AI. To learn more about the benchmark, contribute tasks, or evaluate your own agents, visit tbench.ai or join the project’s Discord community. To find out more about Snorkel’s data development platform and our work with frontier AI labs, visit us at snorkel.ai and connect with our team.
Kobie Crawford is a Developer Advocate at Snorkel AI, with a focus on engaging AI research and development communities. He comes to Snorkel after a successful journey with MosaicML and Databricks, the latter acquiring the former in 2023.
Jeong Shin completed her tenure as Research Intern at Snorkel AI in September 2025; her internship focused on building agentic evaluations. Before Snorkel, Jeong completed a masters degree with a focus on computer science from Stanford.
Tom Walshe is a Staff Research Scientist at Snorkel AI. Before Snorkel, Tom worked in LegalTech and finance services, where he focussed on building end-to-end AI systems and researching data-centric AI. Prior to industry, Tom completed a PhD in Computer Science from the University of Oxford.