Research

Coding agents don’t need to be perfect, they need to recover

February 13, 2026
5 min read

Error analysis of 8 models on Agentic Coding tasks

Successful completion of complex tasks doesn’t come from models being always right. It comes from models being resilient when things go wrong. To get a deeper understanding of model behavior in agentic environments, our team analyzed all of the errors found in the full traces of tasks from our Agentic Coding benchmark completed by eight models. Our analysis breaks down how models fail in agentic scenarios, revealing six key insights, and highlighting some recurring themes for continued research. For a primer on our Agentic Coding benchmark, please check out this blog post and our leaderboard.

(Note: As of this writing, our analysis of Opus 4.6 is underway, and we are still looking forward to GPT 5.3-Codex API access. We can’t wait to share those results as well, given the significant improvements in coding ability achieved by both models!)

Setup

  • Data: ~4000 classified errors from 1,805 task runs across 99 unique tasks
  • Models: 8 frontier models evaluated (Claude Opus/Sonnet, GPT-5.2, Gemini 3 Pro, Grok 4.1, Kimi K2, Nemotron 3 Nano, Qwen 3 Coder)
  • Classification: Each error labeled by type, category, fatality, and recovery status using LLM-based extraction from agent trajectories
  • Taxonomy: 11 error categories and ~80 error types (e.g., command_not_found, dns_resolution_failure)

Insight #1: Recovery differentiates passed and failed tasks

Recovery ability, not error avoidance, is the key differentiator between passed and failed tasks.

Passed and failed tasks encounter similar numbers of errors (2.09 vs 2.71 per task). The difference lies in what happens next: passed tasks recover from 95.0% of errors, while failed tasks only recover from 73.5% — a gap of 21.5 percentage points.

Insight #2: Four model profile archetypes emerge

When we plot each model’s error frequency against recovery rate, distinct patterns emerge across models.

Key Observations:

  • Claude Opus 4.5 achieves the best profile: fewest errors (2.09/task) with highest recovery (87.0%)
  • Qwen 3 Coder encounters the most errors (3.04/task) but maintains strong recovery (83.5%)—persistence pays off
  • Nemotron 3 Nano has a concerning profile: 42.0% of its errors are fatal, the highest among all models
  • Nemotron 3 Nano struggles on both dimensions: high error rate (2.97/task) with low recovery (70.2%)

Insight #3: The error landscape — common ≠ deadly

Not all error categories are created equal. The most frequent errors are often the most recoverable.

Key observations:

  • CLI & Invocation errors dominate (1562 occurrences, 37% of all errors) but are highly recoverable (85% recovery rate)
  • Network errors are rare but deadly—only 35% recovery rate. When a model can’t reach an endpoint, it usually can’t fix that.

Insight #4: Unrecoverable errors end runs

Some error types are nearly impossible to recover from. Understanding these helps explain why certain tasks fail.

Examples of unrecoverable errors:

  • dns_resolution_failure (16% recovery, n=38)
    • Example: The hostname ‘waf’ cannot be resolved….
  • connection_refused_unreachable (39% recovery, n=31)
    • Example: Failed to connect to waf port 80 after 0 ms: Couldn’t connect to server…
  • process_crash_segfault (41% recovery, n=39)
    • Example: The process crashed with a SIGSEGV due to stack misalignment when calling system()….

Why these are hard:

  • DNS and connection failures indicate infrastructure/environment issues the model cannot fix
  • Process crashes terminate execution abruptly with no opportunity to recover
  • Missing output from pipelines leaves the model with nothing to work with

Insight #5: Recoverable errors have known fixes

On the other end, some errors have well-known solutions that models consistently apply.

Examples of successful recovery patterns:

  • externally_managed_environment (93% recovery)
    • Typical fix: Used –break-system-packages flag to install pipenv.
  • unknown_option_subcommand (95% recovery)
    • Typical fix: Rewrote the CLI to properly handle subcommands in Typer.
  • permission_denied_execute (96% recovery)
    • Typical fix: The agent added execute permissions using ‘chmod +x heapedit’.
  • dependency_resolution_unsatisfied (100% recovery)
    • Typical fix: The agent updated the Redis role to install redis-server directly instead of the redis metapackage.
  • service_manager_unavailable (100% recovery)
    • Typical fix: Modified the approach to start Redis directly using redis-server command instead of systemd.

Why these are easy:

  • Clear error messages that indicate the exact problem
  • Standard solutions (install package, use different flag, switch approach)
  • The error doesn’t corrupt state—the model can simply retry with a fix

Insight #6: Many models struggle on the hardest tasks

Some tasks generate errors for every model that attempts them

heap-ctf leads with 217 total errors across 8 models. Tasks like this often involve:

  • Complex multi-step procedures (exploitation, reverse engineering)
  • Unfamiliar or specialized tooling
  • Environmental setup that differs from training data

Key Takeaways and Conclusion

We see these insights from multiple perspectives: for model builders, there is a great deal of opportunity in hill-climbing on error recovery as a core skill, teaching models to find additional modes of adaptation to adverse conditions. For benchmark task creators, it is critical to focus on developing complex environments that generate realistic errors, simulating the systems and infrastructure in which agents interact as thoroughly as possible.

The scale and sophistication of what agents can do will be driven by how we challenge them. Snorkel Research is ready to partner with you, with the datasets and realistic environments that push the frontier of AI capabilities outward. Come talk to us!

Share this article
Image
Ramya Ramakrishnan
Applied Research Scientist

Recommended articles

View all articles
benchmarks-3-axis
The Art and Science of Building AI Benchmarks That Shape the Field
Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with
June 16, 2026
Snorkel Team
Image
Cua-Bench: benchmarking computer-use agents on professional software
TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —
June 15, 2026
Armin Parchami
,
Zhengyang (Jason) Qi
agentic-in-action
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on
June 5, 2026
Snorkel Team
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.