Stress-test & train frontier coding agents in real CLI environments

The Agentic Coding Data Series captures real-world software engineering — from multi-step reasoning and iterative debugging to environment manipulation and tool use inside realistic command-line environments.

Developed by Snorkel’s AI Data Research Lab, Terminal-Bench+ provides thousands of high-signal, T-Bench 2.0-style tasks — built to challenge and improve frontier models.

REQUEST DATA SAMPLES //

By submitting this form, I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Coding task categories include

System & environment setup

Build / compilation / dependency management

Data processing / ETL / scripting

Interactive simulations & games

Software engineering workflows

ML training & inference

Debugging & repair tasks

Security & cryptography

Scientific computing

Terminal-Bench+ is intentionally calibrated to stress-test state-of-the-art models

Built for Frontier model evaluation:

Tiered difficulty from Core to Frontier
<40% accuracy on Frontier tasks across leading models
Designed for RL training, benchmarking, and deployment validation

If your agent succeeds here, it performs in production.

Why the Snorkel Data Series

High-volume quarterly drops

Multi-layer quality pipeline

Unified execution environment

Direct roadmap influence

Expert-led validation

Every task is built and validated through a multi-layer quality pipeline.

Human review

SMEs verify clarity, correctness, and full solvability.

LLM-assisted validation

Automated checks flag instruction-test mismatches and missing constraints.

Deterministic testing

Code-based unit tests validate compliance, syntax, formatting, and outcomes.

Guardrails

Additional checks catch cheating paths, non-determinism, and reward hacking.

Accelerate agent performance using verifiable, multi-step CLI environments with the Snorkel Data Series

Talk to a researcher