ImageImage
ImageImage
SNORKEL DATA SERIES //
Agentic Coding //
Terminal-Bench+

Stress-test & train frontier coding agents in real CLI environments

The Agentic Coding Data Series captures real-world software engineering — from multi-step reasoning and iterative debugging to environment manipulation and tool use inside realistic command-line environments.

Developed by Snorkel’s AI Data Research Lab, Terminal-Bench+ provides thousands of high-signal, T-Bench 2.0-style tasks — built to challenge and improve frontier models.

REQUEST DATA SAMPLES //

Coding task categories include

  • System & environment setup
  • Build / compilation / dependency management
  • Data processing / ETL / scripting
  • Interactive simulations & games
  • Software engineering workflows
  • ML training & inference
  • Debugging & repair tasks
  • Security & cryptography
  • Scientific computing

Terminal-Bench+ is intentionally calibrated to stress-test state-of-the-art models

Built for Frontier model evaluation:

  • Tiered difficulty from Core to Frontier
  • <40% accuracy on Frontier tasks across leading models
  • Designed for RL training, benchmarking, and deployment validation

If your agent succeeds here, it performs in production.

Why the Snorkel Data Series

Image
High-volume quarterly drops
Image
Multi-layer quality pipeline
Image
Unified execution environment
Image
Direct roadmap influence

Expert-led validation

Every task is built and validated through a multi-layer quality pipeline.

01

Human review

SMEs verify clarity, correctness, and full solvability.
02

LLM-assisted validation

Automated checks flag instruction-test mismatches and missing constraints.
03

Deterministic testing

Code-based unit tests validate compliance, syntax, formatting, and outcomes.
04

Guardrails

Additional checks catch cheating paths, non-determinism, and reward hacking.
Image
Image

Accelerate agent performance using verifiable, multi-step CLI environments with the Snorkel Data Series

Talk to a researcher