Coding agents don't need to be perfect, they need to recover

Error analysis of 8 models on Agentic Coding tasks

Successful completion of complex tasks doesn’t come from models being always right. It comes from models being resilient when things go wrong. To get a deeper understanding of model behavior in agentic environments, our team analyzed all of the errors found in the full traces of tasks from our Agentic Coding benchmark completed by eight models. Our analysis breaks down how models fail in agentic scenarios, revealing six key insights, and highlighting some recurring themes for continued research. For a primer on our Agentic Coding benchmark, please check out this blog post and our leaderboard.

(Note: As of this writing, our analysis of Opus 4.6 is underway, and we are still looking forward to GPT 5.3-Codex API access. We can’t wait to share those results as well, given the significant improvements in coding ability achieved by both models!)

Setup

Data: ~4000 classified errors from 1,805 task runs across 99 unique tasks
Models: 8 frontier models evaluated (Claude Opus/Sonnet, GPT-5.2, Gemini 3 Pro, Grok 4.1, Kimi K2, Nemotron 3 Nano, Qwen 3 Coder)
Classification: Each error labeled by type, category, fatality, and recovery status using LLM-based extraction from agent trajectories
Taxonomy: 11 error categories and ~80 error types (e.g., command_not_found, dns_resolution_failure)

Insight #1: Recovery differentiates passed and failed tasks

Recovery ability, not error avoidance, is the key differentiator between passed and failed tasks.

Passed and failed tasks encounter similar numbers of errors (2.09 vs 2.71 per task). The difference lies in what happens next: passed tasks recover from 95.0% of errors, while failed tasks only recover from 73.5% — a gap of 21.5 percentage points.

Insight #2: Four model profile archetypes emerge

When we plot each model’s error frequency against recovery rate, distinct patterns emerge across models.

Key Observations:

Claude Opus 4.5 achieves the best profile: fewest errors (2.09/task) with highest recovery (87.0%)
Qwen 3 Coder encounters the most errors (3.04/task) but maintains strong recovery (83.5%)—persistence pays off
Nemotron 3 Nano has a concerning profile: 42.0% of its errors are fatal, the highest among all models
Nemotron 3 Nano struggles on both dimensions: high error rate (2.97/task) with low recovery (70.2%)

Insight #3: The error landscape — common ≠ deadly

Not all error categories are created equal. The most frequent errors are often the most recoverable.

Key observations:

CLI & Invocation errors dominate (1562 occurrences, 37% of all errors) but are highly recoverable (85% recovery rate)
Network errors are rare but deadly—only 35% recovery rate. When a model can’t reach an endpoint, it usually can’t fix that.

Insight #4: Unrecoverable errors end runs

Some error types are nearly impossible to recover from. Understanding these helps explain why certain tasks fail.

Examples of unrecoverable errors:

dns_resolution_failure (16% recovery, n=38)
- Example: The hostname ‘waf’ cannot be resolved….
connection_refused_unreachable (39% recovery, n=31)
- Example: Failed to connect to waf port 80 after 0 ms: Couldn’t connect to server…
process_crash_segfault (41% recovery, n=39)
- Example: The process crashed with a SIGSEGV due to stack misalignment when calling system()….

Why these are hard:

DNS and connection failures indicate infrastructure/environment issues the model cannot fix
Process crashes terminate execution abruptly with no opportunity to recover
Missing output from pipelines leaves the model with nothing to work with

Insight #5: Recoverable errors have known fixes

On the other end, some errors have well-known solutions that models consistently apply.

Examples of successful recovery patterns:

externally_managed_environment (93% recovery)
- Typical fix: Used –break-system-packages flag to install pipenv.
unknown_option_subcommand (95% recovery)
- Typical fix: Rewrote the CLI to properly handle subcommands in Typer.
permission_denied_execute (96% recovery)
- Typical fix: The agent added execute permissions using ‘chmod +x heapedit’.
dependency_resolution_unsatisfied (100% recovery)
- Typical fix: The agent updated the Redis role to install redis-server directly instead of the redis metapackage.
service_manager_unavailable (100% recovery)
- Typical fix: Modified the approach to start Redis directly using redis-server command instead of systemd.

Why these are easy:

Clear error messages that indicate the exact problem
Standard solutions (install package, use different flag, switch approach)
The error doesn’t corrupt state—the model can simply retry with a fix

Insight #6: Many models struggle on the hardest tasks

Some tasks generate errors for every model that attempts them

heap-ctf leads with 217 total errors across 8 models. Tasks like this often involve:

Complex multi-step procedures (exploitation, reverse engineering)
Unfamiliar or specialized tooling
Environmental setup that differs from training data

Key Takeaways and Conclusion

We see these insights from multiple perspectives: for model builders, there is a great deal of opportunity in hill-climbing on error recovery as a core skill, teaching models to find additional modes of adaptation to adverse conditions. For benchmark task creators, it is critical to focus on developing complex environments that generate realistic errors, simulating the systems and infrastructure in which agents interact as thoroughly as possible.

The scale and sophistication of what agents can do will be driven by how we challenge them. Snorkel Research is ready to partner with you, with the datasets and realistic environments that push the frontier of AI capabilities outward. Come talk to us!