blur-bg-frame-lightblur-bg-frame-dark
dark-curve-bglight-curve-bg
SNORKEL DATA SERIES //
Agentic Coding //

SWE-Bench-CLI+

Stress-test & train frontier coding agents the way real software engineers work

SWE-Bench CLI+ is training and evaluation data for frontier coding agents, built from real software engineering tasks in production repositories. Each task drops an agent into a live codebase with full terminal access to read failures, iterate, and verify its own work. Source PRs are merged shortly before authoring, so your signal survives the next model release instead of going stale.

Built by Snorkel's AI Data Research Lab, the SWE-Bench CLI+ data series spans thousands of Harbor-format tasks across the full range of engineering work, from quick fixes to multi-file refactors, to provide stronger evaluation and training signal than patch-only benchmarks.

REQUEST DATA SAMPLES //
By submitting this form, I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Coding tasks that span the full range of real engineering work

  • Fix
    Fixing known bugs and faults in the codebase

  • Feature
    Introducing new internal or user-facing features

  • Refactor
    Restructuring existing code for maintainability

  • Build
    Changing build configurations

  • Performance
    Improving performance, such as reducing memory consumption

  • Chore
    Project-wide housekeeping: dependency bumps, version increments, and cleanup

SWE-Bench-CLI+ is intentionally calibrated to stress-test state-of-the-art coding agents

Built for frontier model evaluation and training:

  • Empirical difficulty tiers measured against current frontier models, not author judgment
  • Frontier-tier subset where leading models score as low as 21% Pass@1
  • Multiple languages across the mix, not Python-only

If your agent succeeds here, it performs in production.

Why the Snorkel Data Series

High volume quarterly drops icon
High-volume quarterly drops
Multi layer quality pipeline icon
Multi-layer quality pipeline
Unified execution environment icon
Unified execution environment
Direct roadmap influence icon
Direct roadmap influence

Expert-led validation

Every task is built and validated through a multi-layer quality pipeline.

01

Expert review

Expert contributors independently review solution correctness, prompt clarity, test reliability, and tagging.

02

Quality control checks

Snorkel's proprietary quality control models validate every component of a task.

03

Difficulty validation

Task difficulty is measured empirically against frontier models and agents.

04

Distribution guardrails

Diversity across language, task type, and difficulty is controlled through task metadata.

feather graphics blur image
feather graphics normal image

Train coding agents that can navigate, patch, and verify real codebases with the Snorkel Data Series