Challenge & train frontier agents in high-fidelity enterprise simulations

Snorkel's collection of enterprise environments that simulate real-world companies, built by domain experts to train AI agents on enterprise workflows. Each environment includes thousands of tau-bench style tasks, complete with databases, company policy documents, domain tools, and simulated users that reflect the complexities agents have to operate in.

Snorkel has helped models improve from 10.9% to 42.0% pass@1 (Insurance Underwriting).

Developed by Snorkel’s AI Data Research Lab, each environment is authored and validated by teams of domain experts and calibrated against frontier models.

available now

Underwriting (Property & Casualty Insurance Company)

Finance (Investment Banking / Equity Research)

REQUEST DATA SAMPLES //

By submitting this form, I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Enterprise Environments are intentionally calibrated to stress-test state-of-the-art agents

Built for frontier model evaluation and training:

Empirical difficulty tiers measured against current frontier models
A Frontier tier where today's leading models score below 40%
Tasks that require agents to plan, use domain tools, and safely update system state, not just retrieve answers

If your agent succeeds here, it performs in production.

Why the Snorkel Data Series

High-volume quarterly drops

Multi-layer quality pipeline

Unified execution environment

Direct roadmap influence

Expert-led validation

Every task is built and validated through a multi-layer quality pipeline.

Expert review

Subject-matter experts review every task for oracle correctness, prompt clarity, environment correctness, and tagging.

Programmatic checks

Automated validation ensures every task manifest is complete and not a trivial variation of an existing task.

Distribution guardrails

New submissions are accepted only if they maintain dataset-level balance across tool domains, difficulty levels, and categories.

Judge stability

LLM-as-judge scoring uses fixed parameters and is benchmarked against human annotators to control variance.

Train enterprise agents that can plan, use tools, and safely update system state with the Snorkel Data Series

Talk to a researcher