SNORKEL DATA SERIES //
Agentic Coding //
Terminal-Bench+
Stress-test & train frontier coding agents in real CLI environments
The Agentic Coding Data Series captures real-world software engineering — from multi-step reasoning and iterative debugging to environment manipulation and tool use inside realistic command-line environments.
Developed by Snorkel’s AI Data Research Lab, Terminal-Bench+ provides thousands of high-signal, T-Bench 2.0-style tasks — built to challenge and improve frontier models.
REQUEST DATA SAMPLES //
Coding task categories include
- System & environment setup
- Build / compilation / dependency management
- Data processing / ETL / scripting
- Interactive simulations & games
- Software engineering workflows
- ML training & inference
- Debugging & repair tasks
- Security & cryptography
- Scientific computing
Terminal-Bench+ is intentionally calibrated to stress-test state-of-the-art models
Built for Frontier model evaluation:
- Tiered difficulty from Core to Frontier
- <40% accuracy on Frontier tasks across leading models
- Designed for RL training, benchmarking, and deployment validation
If your agent succeeds here, it performs in production.
Why the Snorkel Data Series
High-volume quarterly drops
Multi-layer quality pipeline
Unified execution environment
Direct roadmap influence
Expert-led validation
Every task is built and validated through a multi-layer quality pipeline.
01
Human review
SMEs verify clarity, correctness, and full solvability.
02
LLM-assisted validation
Automated checks flag instruction-test mismatches and missing constraints.
03
Deterministic testing
Code-based unit tests validate compliance, syntax, formatting, and outcomes.
04
Guardrails
Additional checks catch cheating paths, non-determinism, and reward hacking.
Accelerate agent performance using verifiable, multi-step CLI environments with the Snorkel Data Series
Talk to a researcher