Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

Overview

The Snorkel Agentic Coding benchmark comprises 100 multi-step coding tasks, evenly distributed across four difficulty tiers, designed to evaluate models across a diverse range of capabilities germane to real-world software engineering work.

Taking insights from our contributions to the Terminal-Bench project, our Agentic Coding tasks evaluate agents in fully sandboxed execution environments. Each task is paired with a human-validated reference solution, comprehensive unit tests, and scoring rubrics that assess both final outputs and the agent’s trajectory. The current version of the benchmark spans a wide range of task categories, from typical software engineering related tasks, to advanced ML and data analytics, as well as build and dependency management tasks, and tests agents on long-horizon planning, tracking tasks, evaluating and executing their own solutions, and recovering from potential errors and incorrect previous steps.

Our benchmark is built to challenge even the most advanced frontier models. Tasks are constructed with experts in the loop, confirming every challenge to be solvable in the environments in which they run, and verifying the reliability of all dependencies. We have calibrated the tasks so they deliver a range of difficulties, providing meaningful feedback for agents and models across the cost/performance spectrum -- from those pursuing Pareto-optimal results, to those that are delivering truly frontier-level capabilities.

Data Sample

Below is a sample of the kinds of coding challenges provided in the dataset. Each sample question like the one below is accompanied by all data and dependencies required for execution, a suite of tests, a rubric used by both human and LLM evaluators to score the result, and a golden solution.

question

Incident Commander: Payments Canary Rollback

You are the on-call SRE responding to a spike in 5xx errors after a canary rollout of payments-api. All observability artifacts, runbooks, and config files are already packaged inside the container. No network access is allowed.

Requirements

Diagnose the incident using logs, metrics, and traces to identify the root cause and blast radius.
Execute mitigation per the runbook (/app/runbooks/payments-canary-rollback.md):
- Update /app/config/service_state.json as specified in the runbook
- Only modify payments-api - do not change any other services
- Only change fields specified in the runbook - preserve all other fields exactly
- Create any required configuration files specified in the runbook
Produce output files documenting your diagnosis, actions, and validation.

Inputs

/app/data/logs/app.log — application logs
/app/data/metrics/before.json — pre-incident metrics
/app/data/metrics/after.json — post-mitigation metrics
/app/data/traces/trace_sample.json — distributed traces
/app/data/deployments/ — deployment manifests
/app/runbooks/payments-canary-rollback.md — mitigation runbook
/app/config/service_state.json — service configuration

Outputs

All outputs must be created exactly as specified below.

`/app/outputs/incident_summary.md`

Markdown document with these sections: Overview, Root Cause, Blast Radius, Evidence, Actions Taken, Validation, Next Steps.

Cite full artifact paths used in analysis
Include pod, DB host, and failure mode in root cause
Include the primary trace ID (the first payments-api canary error trace with DB connection issues, by timestamp), rollback_token, and correlation ID
List affected transaction IDs and count
Include the incident time window (start and end timestamps of payments-api canary DB errors)
Confirm unaffected services explicitly - check all other services in service_state.json (excluding payments-api) and confirm they are unaffected
Show before/after metrics with deltas
Include an "SLO Budget" section containing the words "SLO" and "budget", plus the calculated remaining budget value (monthly_budget_pct minus consumed_pct, rounded to 2 decimal places)
Use "disable" or "disabled" when describing the canary action

`/app/outputs/actions.json`

JSON array with exactly FIVE action objects:

{"action": "disable_canary", "target": "payments-api", "status": "...", "details": "...", "evidence": "...", "rollback_token": "...", "correlation_id": "..."}
- evidence must reference app.log and include the primary trace ID
- correlation_id must be extracted from the relevant log entries
{"action": "create_alert_suppression", "target": "payments-api", "status": "...", "details": "..."}
{"action": "create_followup_ticket", "target": "payments-api", "status": "...", "details": "...", "priority": "...", "assigned_team": "..."}
- priority and assigned_team must be determined per the runbook
{"action": "notify_stakeholders", "target": "payments-api", "status": "...", "details": "...", "channel": "...", "escalation_level": "..."}
- channel and escalation_level must be determined per the runbook
{"action": "update_deployment_status", "target": "payments-api", "status": "...", "details": "...", "previous_status": "...", "new_status": "...", "canary_version": "..."}
- previous_status: the status from the deployment manifest before rollback
- new_status: "rolled_back"
- canary_version: the version from the deployment manifest

`/app/outputs/postcheck.json`

JSON object with fields: error_rate_pct, p99_latency_ms, validation_passed, notes, affected_txn_count, impact_pct, incident_window, slo_budget, mttr_seconds, error_velocity

error_rate_pct and p99_latency_ms: values from after-metrics for payments-api
validation_passed: boolean based on runbook thresholds
notes: must contain "after.json", the literal metric names ("error_rate_pct", "p99_latency_ms"), and their threshold values ("5" and "400")
affected_txn_count: count of unique transactions from payments-api canary DB error log entries
impact_pct: (affected_txn_count / total_requests_from_before_metrics) * 100, rounded to 2 decimal places
incident_window: object with start and end fields containing the first and last payments-api canary DB error timestamps from logs
slo_budget: object with these exact fields:
- monthly_budget_pct: 0.1 (representing 99.9% availability SLO)
- consumed_pct: the error_rate_pct from before-metrics (during incident)
- remaining_pct: monthly_budget_pct minus consumed_pct
mttr_seconds: Mean Time To Recovery - seconds between first canary DB error and last canary DB error (parse timestamps and compute difference)
error_velocity: errors per minute during incident - affected_txn_count divided by (mttr_seconds / 60), rounded to 2 decimal places

`/app/config/alert_suppression.json`

JSON object with fields:

service: "payments-api"
suppression_window_minutes: 30
reason: description of the incident
created_by: "incident-commander"
expires_at: ISO 8601 timestamp, computed as incident start time (first canary error) + 30 minutes

`/app/data/deployments/canary_manifest.json`

Update the deployment manifest:

Find the payments-api deployment entry
Change its status field from "active" to "rolled_back"
Preserve all other fields and entries exactly

solution

#!/bin/bash
set -euo pipefail

cd /app
mkdir -p /app/outputs

python3 - <<'PY'
import json
import pathlib
import re
import textwrap
from datetime import datetime, timedelta

base = pathlib.Path("/app")

before_metrics = json.loads((base / "data/metrics/before.json").read_text())
after_metrics = json.loads((base / "data/metrics/after.json").read_text())
payments_before = before_metrics["payments-api"]
payments_after = after_metrics["payments-api"]

traces_data = json.loads((base / "data/traces/trace_sample.json").read_text())

payments_canary_errors = []
for t in traces_data.get("traces", []):
    if t.get("root_service") != "payments-api" or t.get("outcome") != "error":
        continue
    for span in t.get("spans", []):
        tags = span.get("tags", {})
        if tags.get("deployment") == "canary" and tags.get("db.error"):
            payments_canary_errors.append(t)
            break

payments_canary_errors.sort(key=lambda x: x.get("timestamp", ""))
primary_trace = payments_canary_errors[0] if payments_canary_errors else None
primary_trace_id = primary_trace["trace_id"] if primary_trace else ""

canary_pod = None
db_host = None
if primary_trace:
    for span in primary_trace.get("spans", []):
        if span.get("tags", {}).get("pod"):
            canary_pod = span["tags"]["pod"]
        if span.get("tags", {}).get("host"):
            db_host = span["tags"]["host"]

logs = (base / "data/logs/app.log").read_text()

deployments_path = base / "data/deployments/canary_manifest.json"
deployments = json.loads(deployments_path.read_text())
rollback_token = None
canary_version = None
previous_status = None
for dep in deployments.get("deployments", []):
    if dep.get("service") == "payments-api" and dep.get("status") == "active":
        rollback_token = dep.get("rollback_token")
        canary_version = dep.get("version")
        previous_status = dep.get("status")
        dep["status"] = "rolled_back"
        break

deployments_path.write_text(json.dumps(deployments, indent=2))

affected_txns = set()
error_timestamps = []
correlation_id = None
for line in logs.split("\n"):
    if "payments-api" in line and "ERROR" in line and "[txn:" in line:
        if "canary pod" in line and "db connection refused" in line:
            match = re.search(r'\[txn:(pay-\d+)\]', line)
            if match:
                affected_txns.add(match.group(1))
            timestamp = line.split()[0]
            error_timestamps.append(timestamp)
            corr_match = re.search(r'\[corr:(INC-\d{8}-\d{4})\]', line)
            if corr_match and correlation_id is None:
                correlation_id = corr_match.group(1)

affected_txns = sorted(list(affected_txns))
affected_txn_count = len(affected_txns)

incident_window = {
    "start": min(error_timestamps) if error_timestamps else "",
    "end": max(error_timestamps) if error_timestamps else ""
}

start_time = datetime.fromisoformat(incident_window["start"].replace("Z", "+00:00"))
end_time = datetime.fromisoformat(incident_window["end"].replace("Z", "+00:00"))
mttr_seconds = int((end_time - start_time).total_seconds())

if mttr_seconds > 0:
    error_velocity = round(affected_txn_count / (mttr_seconds / 60), 2)
else:
    error_velocity = float(affected_txn_count)

total_requests = payments_before["requests"]
impact_pct = round((affected_txn_count / total_requests) * 100, 2)

error_rate_delta = payments_after["error_rate_pct"] - payments_before["error_rate_pct"]
p99_delta = payments_after["p99_latency_ms"] - payments_before["p99_latency_ms"]

state_path = base / "config/service_state.json"
state = json.loads(state_path.read_text())
payments_owner = state.get("payments-api", {}).get("owner", "payments-team")
payments_state = state.get("payments-api", {})
payments_state["canary_enabled"] = False
payments_state["traffic_split"] = {"stable": 100, "canary": 0}
state["payments-api"] = payments_state
state_path.write_text(json.dumps(state, indent=2))

incident_time = datetime.fromisoformat(primary_trace["timestamp"].replace("Z", "+00:00"))
expires_time = incident_time + timedelta(minutes=30)
expires_at = expires_time.strftime("%Y-%m-%dT%H:%M:%SZ")

alert_suppression = {
    "service": "payments-api",
    "suppression_window_minutes": 30,
    "reason": "Canary rollback due to DB connection failures on canary pod",
    "created_by": "incident-commander",
    "expires_at": expires_at
}
(base / "config/alert_suppression.json").write_text(json.dumps(alert_suppression, indent=2))

priority = "P1" if impact_pct > 1 else "P2"
channel = "pagerduty" if priority == "P1" else "slack"
escalation_level = "L2" if impact_pct > 5 else "L1"

actions = [
    {
        "action": "disable_canary",
        "target": "payments-api",
        "status": "applied",
        "details": "Set traffic_split to stable=100 and canary=0 per runbook",
        "evidence": f"/app/data/logs/app.log; trace_id={primary_trace_id}",
        "rollback_token": rollback_token,
        "correlation_id": correlation_id,
    },
    {
        "action": "create_alert_suppression",
        "target": "payments-api",
        "status": "applied",
        "details": "Created alert suppression for 30 minutes to prevent false positives during recovery",
    },
    {
        "action": "create_followup_ticket",
        "target": "payments-api",
        "status": "applied",
        "details": f"Post-incident review ticket for canary DB connection failures; impact: {impact_pct}% of requests",
        "priority": priority,
        "assigned_team": payments_owner,
    },
    {
        "action": "notify_stakeholders",
        "target": "payments-api",
        "status": "applied",
        "details": f"Notified stakeholders about payments-api canary rollback via {channel}",
        "channel": channel,
        "escalation_level": escalation_level,
    },
    {
        "action": "update_deployment_status",
        "target": "payments-api",
        "status": "applied",
        "details": f"Updated deployment status from {previous_status} to rolled_back",
        "previous_status": previous_status,
        "new_status": "rolled_back",
        "canary_version": canary_version,
    }
]
(base / "outputs/actions.json").write_text(json.dumps(actions, indent=2))

validation_passed = payments_after["error_rate_pct"] < 5 and payments_after["p99_latency_ms"] < 400
postcheck = {
    "error_rate_pct": payments_after["error_rate_pct"],
    "p99_latency_ms": payments_after["p99_latency_ms"],
    "validation_passed": validation_passed,
    "notes": "Values from /app/data/metrics/after.json for payments-api; thresholds: error_rate_pct < 5 and p99_latency_ms < 400",
    "affected_txn_count": affected_txn_count,
    "impact_pct": impact_pct,
    "incident_window": incident_window,
    "slo_budget": {
        "monthly_budget_pct": 0.1,
        "consumed_pct": payments_before["error_rate_pct"],
        "remaining_pct": round(0.1 - payments_before["error_rate_pct"], 2),
    },
    "mttr_seconds": mttr_seconds,
    "error_velocity": error_velocity,
}
(base / "outputs/postcheck.json").write_text(json.dumps(postcheck, indent=2))

summary = textwrap.dedent(
    f"""\
# Incident Summary

## Overview
Spike in payments-api 5xx errors after canary rollout; mitigated by rollback to stable.
Rollback token: {rollback_token}
Correlation ID: {correlation_id}

Incident window: {incident_window["start"]} to {incident_window["end"]}
MTTR: {mttr_seconds} seconds
Error velocity: {error_velocity} errors/minute

## Root Cause
Payments-api canary pod `{canary_pod}` experienced DB connection failures to `{db_host}`.
The canary build could not establish connections, triggering circuit breaker after {affected_txn_count} consecutive failures.

Before incident: error_rate_pct={payments_before["error_rate_pct"]}%, p99_latency_ms={payments_before["p99_latency_ms"]}
After mitigation: error_rate_pct={payments_after["error_rate_pct"]}%, p99_latency_ms={payments_after["p99_latency_ms"]}
Delta: error_rate_pct={error_rate_delta}% (Δ{error_rate_delta}%), p99_latency_ms={p99_delta}ms (Δ{p99_delta}ms)

Primary trace ID: {primary_trace_id}

## Blast Radius
Impact limited to payments-api requests routed to canary (30% traffic).
Affected transactions: {affected_txn_count} ({", ".join(affected_txns)})
Impact percentage: {impact_pct}% of total requests

Unaffected services:
- checkout: No correlated errors detected during incident window
- inventory-api: Normal operation, lock timeout was unrelated to payments incident
- notification-api: No alerts triggered
- user-api: Canary operating normally, validation error was user input issue (unrelated)
- audit-api: No data integrity issues detected

## Evidence
- Logs: /app/data/logs/app.log
- Metrics (before): /app/data/metrics/before.json
- Metrics (after): /app/data/metrics/after.json
- Trace sample: /app/data/traces/trace_sample.json
- Runbook: /app/runbooks/payments-canary-rollback.md
- Deployment manifest: /app/data/deployments/canary_manifest.json

## Actions Taken
- Executed rollback: disable_canary on payments-api
- traffic_split set to stable=100, canary=0
- Created alert suppression for payments-api (30 minute window)
- Created follow-up ticket ({priority}) assigned to {payments_owner}
- Notified stakeholders via {channel} (escalation level: {escalation_level})
- Updated deployment status to rolled_back (was: {previous_status})
- Rollback token: {rollback_token}

## Validation
Post-mitigation metrics:
- error_rate_pct: {payments_after["error_rate_pct"]}% (threshold < 5%) ✓
- p99_latency_ms: {payments_after["p99_latency_ms"]}ms (threshold < 400ms) ✓
- validation_passed: {validation_passed}

Metrics show error_rate_pct: {payments_before["error_rate_pct"]}% → {payments_after["error_rate_pct"]}% (Δ{error_rate_delta}%)
Metrics show p99_latency_ms: {payments_before["p99_latency_ms"]}ms → {payments_after["p99_latency_ms"]}ms (Δ{p99_delta}ms)

SLO Budget Analysis:
- Monthly error budget: 0.1% (99.9% availability target)
- Consumed during incident: {payments_before["error_rate_pct"]}%
- Remaining budget: {round(0.1 - payments_before["error_rate_pct"], 2)}%

## Next Steps
- Keep canary disabled until a fixed build is available
- Investigate DB connection pool settings on canary pods
- Add connection retry/backoff tuning
- Monitor error budget for payments-api for 24h
- Post-incident review scheduled ({priority} ticket created)
"""
)
(base / "outputs/incident_summary.md").write_text(summary)
PY

rubric

Agent reads runbook file before taking mitigation actions, +3

Agent uses grep with targeted patterns to extract payments-api canary DB errors from logs and shows matching output, +2

Agent uses sed or head/tail to preview key data files (logs, metrics, traces) before processing, +1

Agent parses logs to extract incident window timestamps (start/end) and shows calculated values, +2

Agent parses traces to identify primary trace ID and shows trace analysis output, +2

Agent parses before.json and after.json to extract payments-api metrics and shows metric values, +2

Agent modifies service_state.json correctly per runbook (sets canary_enabled=false or traffic_split) and verifies changes with cat, +5

Agent updates canary_manifest.json status field from "active" to "rolled_back" and verifies with sed/cat, +3

Agent creates alert_suppression.json with correct structure and verifies output, +2

Agent generates all four required output files (incident_summary.md, actions.json, postcheck.json, alert_suppression.json), +5

Agent verifies produced output files by displaying content with sed/cat commands, +2

Agent extracts affected transaction IDs from logs using regex patterns and shows count, +2

Agent computes derived metrics (impact_pct, mttr_seconds, error_velocity, slo_budget) using Python or similar, +3

Agent recovers from errors in data parsing or file operations and successfully retries, +2

Agent uses Python for complex JSON/log parsing with visible output showing extracted data, +2

Agent modifies files outside /app workspace without task-related justification, -3

Agent repeats same failing grep or parsing command ≥3 times without modification, -1

Agent claims task completion while verification commands show missing or malformed output files, -5

Agent modifies service_state.json fields for services other than payments-api, -5

Agent skips reading runbook before making configuration changes, -3

environment

├── environment/                       # Docker container contents
│   ├── Dockerfile
│   │
│   ├── config/
│   │   └── service_state.json         # Service configs (6 microservices)
│   │
│   ├── data/
│   │   ├── logs/
│   │   │   └── app.log                # Application logs (~70 lines)
│   │   │
│   │   ├── metrics/
│   │   │   ├── before.json            # Pre-incident metrics
│   │   │   └── after.json             # Post-mitigation metrics
│   │   │
│   │   ├── traces/
│   │   │   └── trace_sample.json      # Distributed traces (18 spans)
│   │   │
│   │   └── deployments/
│   │       └── canary_manifest.json   # Deployment manifest
│   │
│   ├── runbooks/
│   │   └── payments-canary-rollback.md  # SRE incident runbook
│   │
│   └── outputs/                       # Agent writes outputs here
│       └── .gitkeep

Evaluation Methodology

Models are evaluated using the Pass@5 metric through the Harbor evaluation harness. Each task has a specific timeout limit, with an absolute maximum of 30 minutes for both agent and verifier.

Learn more

BLOG

Introducing the Snorkel Agentic Coding Benchmark

Claude Opus 4.6

65.2%

Claude Opus 4.5

58%

Claude Sonnet 4.5

57.6%

Gemini 3 Pro Preview

51.6%

gpt-5.2

49.4%

gpt-5

45.2%

Kimi-K2-Thinking

36.8%

Devstral 2

33.2%

Grok 4.1 Fast

25.2%

Qwen 3 Coder 480B

18.8%

Mistral Large 3

13.8%

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.

Talk to Snorkel