Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

Overview

The Snorkel Agentic Coding benchmark comprises 100 multi-step coding tasks, evenly distributed across four difficulty tiers, designed to evaluate models across a diverse range of capabilities germane to real-world software engineering work.

Taking insights from our contributions to the Terminal-Bench project, our Agentic Coding tasks evaluate agents in fully sandboxed execution environments. Each task is paired with a human-validated reference solution, comprehensive unit tests, and scoring rubrics that assess both final outputs and the agent’s trajectory. The current version of the benchmark spans a wide range of task categories, from typical software engineering related tasks, to advanced ML and data analytics, as well as build and dependency management tasks, and tests agents on long-horizon planning, tracking tasks, evaluating and executing their own solutions, and recovering from potential errors and incorrect previous steps.

Our benchmark is built to challenge even the most advanced frontier models. Tasks are constructed with experts in the loop, confirming every challenge to be solvable in the environments in which they run, and verifying the reliability of all dependencies. We have calibrated the tasks so they deliver a range of difficulties, providing meaningful feedback for agents and models across the cost/performance spectrum -- from those pursuing Pareto-optimal results, to those that are delivering truly frontier-level capabilities.

Model Comparison

Loading...

Evaluation Methodology

Models are evaluated using the Pass@5 metric through the Harbor evaluation harness. Each task has a specific timeout limit, with an absolute maximum of 30 minutes for both agent and verifier.

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.