SWE-Bench-CLI+
Stress-test & train frontier coding agents the way real software engineers work
SWE-Bench CLI+ is training and evaluation data for frontier coding agents, built from real software engineering tasks in production repositories. Each task drops an agent into a live codebase with full terminal access to read failures, iterate, and verify its own work. Source PRs are merged shortly before authoring, so your signal survives the next model release instead of going stale.
Built by Snorkel's AI Data Research Lab, the SWE-Bench CLI+ data series spans thousands of Harbor-format tasks across the full range of engineering work, from quick fixes to multi-file refactors, to provide stronger evaluation and training signal than patch-only benchmarks.
Coding tasks that span the full range of real engineering work
Fix
Fixing known bugs and faults in the codebaseFeature
Introducing new internal or user-facing featuresRefactor
Restructuring existing code for maintainabilityBuild
Changing build configurationsPerformance
Improving performance, such as reducing memory consumptionChore
Project-wide housekeeping: dependency bumps, version increments, and cleanup
SWE-Bench-CLI+ is intentionally calibrated to stress-test state-of-the-art coding agents
Built for frontier model evaluation and training:
- Empirical difficulty tiers measured against current frontier models, not author judgment
- Frontier-tier subset where leading models score as low as 21% Pass@1
- Multiple languages across the mix, not Python-only
If your agent succeeds here, it performs in production.
Why the Snorkel Data Series
Expert-led validation
Every task is built and validated through a multi-layer quality pipeline.
Expert review
Expert contributors independently review solution correctness, prompt clarity, test reliability, and tagging.
Quality control checks
Snorkel's proprietary quality control models validate every component of a task.
Difficulty validation
Task difficulty is measured empirically against frontier models and agents.
Distribution guardrails
Diversity across language, task type, and difficulty is controlled through task metadata.