Agentic ai //
Challenge & train frontier agents in high-fidelity enterprise simulations
Snorkel's collection of enterprise environments that simulate real-world companies, built by domain experts to train AI agents on enterprise workflows. Each environment includes thousands of tau-bench style tasks, complete with databases, company policy documents, domain tools, and simulated users that reflect the complexities agents have to operate in.
Snorkel has helped models improve from 10.9% to 42.0% pass@1 (Insurance Underwriting).
Developed by Snorkel’s AI Data Research Lab, each environment is authored and validated by teams of domain experts and calibrated against frontier models.
- Underwriting (Property & Casualty Insurance Company)
- Finance (Investment Banking / Equity Research)
Enterprise Environments are intentionally calibrated to stress-test state-of-the-art agents
Built for frontier model evaluation and training:
- Empirical difficulty tiers measured against current frontier models
- A Frontier tier where today's leading models score below 40%
- Tasks that require agents to plan, use domain tools, and safely update system state, not just retrieve answers
If your agent succeeds here, it performs in production.
Why the Snorkel Data Series
Expert-led validation
Every task is built and validated through a multi-layer quality pipeline.
Expert review
Subject-matter experts review every task for oracle correctness, prompt clarity, environment correctness, and tagging.
Programmatic checks
Automated validation ensures every task manifest is complete and not a trivial variation of an existing task.
Distribution guardrails
New submissions are accepted only if they maintain dataset-level balance across tool domains, difficulty levels, and categories.
Judge stability
LLM-as-judge scoring uses fixed parameters and is benchmarked against human annotators to control variance.