Stress-test & train frontier coding agents the way real software engineers work

SWE-Bench CLI+ is training and evaluation data for frontier coding agents, built from real software engineering tasks in production repositories. Each task drops an agent into a live codebase with full terminal access to read failures, iterate, and verify its own work. Source PRs are merged shortly before authoring, so your signal survives the next model release instead of going stale.

Built by Snorkel's AI Data Research Lab, the SWE-Bench CLI+ data series spans thousands of Harbor-format tasks across the full range of engineering work, from quick fixes to multi-file refactors, to provide stronger evaluation and training signal than patch-only benchmarks.

REQUEST DATA SAMPLES //

By submitting this form, I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Coding tasks that span the full range of real engineering work

Fix
Fixing known bugs and faults in the codebase

Feature
Introducing new internal or user-facing features

Refactor
Restructuring existing code for maintainability

Build
Changing build configurations

Performance
Improving performance, such as reducing memory consumption

Chore
Project-wide housekeeping: dependency bumps, version increments, and cleanup

SWE-Bench-CLI+ is intentionally calibrated to stress-test state-of-the-art coding agents

Built for frontier model evaluation and training:

Empirical difficulty tiers measured against current frontier models, not author judgment
Frontier-tier subset where leading models score as low as 21% Pass@1
Multiple languages across the mix, not Python-only

If your agent succeeds here, it performs in production.

Why the Snorkel Data Series

High-volume quarterly drops

Multi-layer quality pipeline

Unified execution environment

Direct roadmap influence

Expert-led validation

Every task is built and validated through a multi-layer quality pipeline.

Expert review

Expert contributors independently review solution correctness, prompt clarity, test reliability, and tagging.

Quality control checks

Snorkel's proprietary quality control models validate every component of a task.

Difficulty validation

Task difficulty is measured empirically against frontier models and agents.

Distribution guardrails

Diversity across language, task type, and difficulty is controlled through task metadata.

Train coding agents that can navigate, patch, and verify real codebases with the Snorkel Data Series

Talk to a researcher