Snorkeling in RL environments

A deep dive into how we design and build agentic environments for training and evaluating LLMs at Snorkel AI
Modern large language models (LLMs) don’t just learn from static datasets, they learn by interacting with environments that provide tools, state, and structured feedback. In this post, we unpack what makes a high-quality reinforcement learning (RL) environment for LLMs, highlight recent benchmark advances (Meta ARE, OpenAI GDPval), and show how we build realistic, enterprise-grade environments at Snorkel AI.
What is an “RL environment” for LLMs?
An RL environment is a sandbox that powers the action space in which an agent works:
- Context: The available resources and constraints around the agent.
- Actions / tools: APIs, commands, or functions the agent can invoke.
- Observations / state: What the agent sees after each action.
- Evaluation signals: Verifiers, rubrics, or outcome checks that tell us how well it did.

Combined with the tasks that compose the evaluations or benchmarks, well-formed environments encode rules, goals, constraints, inputs/outputs, and provide structured signals like verifiers or rubrics; they’re designed to match real tasks, tools, and success criteria.
The hard part is relevance and reliability at scale—goals are ambiguous, APIs drift, human validation is expensive, and teams often reinvent the structure per use case. We address this with expert-designed tasks, automated quality control, and fine-grained reward design.
Training vs. evaluation: two modes, one environment
We run the same environments in two modes:
- Simulated Agentic Evaluation. Stress-test agents against expert-defined tasks via APIs or hosted environments; use automatic verifiers or expert graders for traces. Think “flight simulator” before real users.
- RL Gyms. Deploy the same tasks and reward functions inside your training loop (in-VPC if needed), with standardized verifiers and reward design.
What “good” looks like
We believe strong RL environments share these properties:
- Faithful tool surface. The agent’s actions map to the tools in the production use case. For web tasks, this means full websites with validators; for terminals, real CLIs and unit tests. WebArena, for example, hosts functional, self-contained sites across e-commerce, forums, coding, and CMS, plus validators to check task completion.
- Deterministic, verifiable rewards. Explicit tests that programmatically assert success (unit tests, state checks) beat vague “looks good” judgments.
- Process + outcome signals. Capture the action/trace and the final output; grade both with rubrics/LLM-as-judge or automated verifiers.
- Distribution control. You need knobs to vary difficulty and diversity without breaking reproducibility. Our SnorkelSpatial benchmark exemplifies this in a simplified 2D environment, with variables for the number of particles, the number of movements, and the size of the game board.
- Dynamics & rules. Provides the logic of how the environment changes with each action. This could be as simple as moving to a next question or as complex as updating a game state or database. Environments often maintain an internal state (e.g., the state of a virtual world or a user session).
Examples of RL environments
Terminal / OS
Evaluate end-to-end system reasoning with real shells, package managers, filesystems, compilers, and tests.
- Terminal-Bench is the most active public effort to do this at scale; it stresses long-horizon CLI tasks and publishes harnesses and Docker images. See our blog post about Snorkel’s contributions to Terminal-Bench.
- AgentBench’s OS track complements this with targeted Bash tasks and reproducible Ubuntu containers.
Browser / web
Full-stack web interaction with long-horizon navigation and grounded goals.
- WebArena ships reproducible websites and programmatic validators (GPT-4 agents achieve ~14% success vs. ~78% human).
- WebShop focuses on shopping flows with 1.18M real products and >12K human goals.
- WALT (Web Agents that Learn Tools) reverse-engineers websites to abstract the human-focused UX into tool calls.
General assistant / multi-tool
Tasks that require combining search, browsing, tool orchestration, and multi-turn dialog.
- GAIA is an evaluation of whether “your assistant actually helps”—humans ~92%, GPT-4+plugins ~15%.
- MINT stresses multi-turn tool use and language feedback with a Python tool-calling harness.
Tool mastery & safety
Focus on API/tool use, failure robustness, and red-teaming.
- ToolLLM introduces ToolBench, which is a large-scale API tool-use data and a train/eval framework for using 16K+ real APIs.
- ToolEmu emulates tools and measures safety risks across 36 high-stakes tools and 144 cases, finding a significant fraction of real-world-valid failures makes it useful for red-teaming agents before production.
Synthetic data via verifiers
Generated reasoning tasks + verifiers to scale cross-domains.
- Loong (CAMEL-AI) synthesizes chain-of-thought trajectories with plug-in verifiers.
Dynamic / asynchronous environments
Environments that evolve independently of the agent’s actions, introducing temporal events, noise, and state drift.
- Meta ARE is a platform for scalable agent environments with asynchronous dynamics and evolving world state.
- Gaia2, built on ARE, tests agent adaptability, ambiguity handling, collaboration, and temporal constraints in dynamic settings.
Economically valued / deliverable-centric evaluations
Benchmarks built around real-world knowledge work, artifacts, and domain context.
- GDPval, introduced by OpenAI, assesses performance on 1,320 tasks across 44 professions, grading AI outputs against expert deliverables. While GDPval today is largely one-shot (non-interactive), it shifts the evaluation emphasis toward economic value, artifact correctness, and domain realism.
How we build environments at Snorkel
Our environments are expert-designed with comprehensive reward feedback. We compose scenarios, tools, rules, and feedback channels, then validate with domain experts and automated quality control (QC) to ensure realism and repeatability. Environments ship with gold traces, rubrics, and verifiers so you can use them for both evaluation and RL.
Step 1: Collecting requirements with domain experts
Every environment begins with a product requirements document (PRD) drafted by a domain expert. For example, an insurance underwriter, financial analyst, or CRM support specialist writes out the actual workflow: what tools are used, what data sources are accessed, what steps are typically required, and how success is judged.
Step 2: Tooling and data design by technical experts
Next, coding and data engineers translate the domain PRD into a set of executable tools, APIs, and data stubs. If the workflow involves SQL queries, underwriting guidelines, or CRM records, we build representative databases, services, and schemas. These tools become the agent’s “action space,” while synthetic or anonymized datasets simulate the live systems. Keys to success:
- Embed deliverables & context. Tasks should carry file contexts, history, and domain artifacts, not bare prompts (e.g., TerminalBench & GDPVal).
Step 3: Reward and rubric development
In parallel, grading criteria are formalized through process-based checks (Did the agent use the right tools in the right order?) and outcome-based checks (Was the final answer correct, safe, and aligned?). Rewards are implemented as deterministic verifiers, unit tests, or fine-grained rubrics. This ensures every environment run yields structured feedback signals —not just binary pass/fail, but rich traces on reasoning, tone, and alignment. Keys to success:
- Instrument for both “how” and “what.” Score the trace (Was the tool sequence safe/efficient?) and the outcome (Is the answer correct?), not just one or the other.
- Prefer verifiable endpoints. Unit tests, state asserts, DB diffs, or form validators are less subjective and easier to scale than open-ended prose grading. This is how Terminal-Bench and WebArena keep evaluation honest.
Step 4: Scenario generation and templatization
We build templated scenarios with controllable variation knobs (difficulty, tool availability, data parameters). This lets us scale environments to thousands of distinct but structurally consistent tasks, reducing brittleness while ensuring diversity. Automation handles sampling and QC, while experts review difficult edge cases. Keys to success:
- Vary distribution, not definitions. Keep task schemas stable while sampling tools, constraints, and difficulty to avoid overfitting.
- Plan for drift. APIs and web UIs change; use self-hosted tools with controlled data when possible, and leverage health checks.
Step 5: Review and validation
Environments and their associated traces undergo dual review:
- Domain experts validate realism and coverage (Does this reflect how the task is done in the real world?).
- Technical experts validate reward quality and catch environment hacking or brittle shortcuts.
Snorkel’s automated QC platform supplements this by checking for reproducibility, distribution, and coverage.
Step 6: Gold traces
For each environment, annotators or subject matter experts generate gold traces. These traces serve as ground truth for evaluation and as high-quality training data for RL or supervised fine-tuning (SFT). For example, in insurance underwriting, gold traces capture 3–7 steps across SQL, guidelines, and appetite matrices.
Summary and next steps
Environments provide the critical testing and training ground for agentic systems that tackle tasks that continue to grow in complexity. The process outlined above is how the Snorkel team builds RL environments and partners with domain experts to develop our agentic benchmarks. Examples of this work are our SnorkelUndewrite and Finance Reasoning public benchmarks, and our contributions to Terminal-Bench 2.0.
Looking ahead, we see how quickly agent capabilities advance, and we’re very excited to push the dynamism in environments further. Our recent paper on automating benchmark design demonstrates how we enable RL environments to continue to raise the bar as agents improve. If you’re building an agentic system and have questions about developing environments, come talk to us.
References & further reading
- The Terminal-Bench Team. (2025). Terminal-Bench: A Benchmark for AI Agents in Terminal Environments. https://github.com/laude-institute/terminal-bench
- Xie, T., et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv preprint arXiv:2404.07972. https://arxiv.org/abs/2404.07972
- Liu, X., et al. (2024). AgentBench: Evaluating LLMs as Agents. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2308.03688
- Andrews, P., et al. (2025). ARE: Scaling Up Agent Environments and Evaluations. arXiv preprint arXiv:2509.17158. https://arxiv.org/abs/2509.17158
- Zhou, S., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854. https://arxiv.org/abs/2307.13854
- Yao, S., et al. (2022). WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. arXiv preprint arXiv:2207.01206. https://arxiv.org/abs/2207.01206
- Koh, J. Y., et al. (2024). VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv preprint arXiv:2401.13649. https://arxiv.org/abs/2401.13649
- Mialon, G., et al. (2023). GAIA: A Benchmark for General AI Assistants. arXiv preprint arXiv:2311.12983. https://arxiv.org/abs/2311.12983
- Wang, X., et al. (2023). MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. arXiv preprint arXiv:2309.10691. https://arxiv.org/abs/2309.10691
- Qin, Y., et al. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv preprint arXiv:2307.16789. https://arxiv.org/abs/2307.16789
- Ruan, Y., et al. (2024). Identifying the Risks of LM Agents with an LM-Emulated Sandbox. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2309.15817
- Liu, M., et al. (2024). APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets. arXiv preprint arXiv:2406.18518. https://arxiv.org/abs/2406.18518
- Huang, X., et. al. (2025). Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers. arXiv preprint arXiv:2509.03059. https://arxiv.org/abs/2509.03059
- Cursor Team. (2025). Improving Cursor Tab with online RL. Cursor Research Blog. https://cursor.com/blog/tab-rl
- Fang, R., Cai, S., Li, B., Wu, J., Li, G., Yin, W., Wang, X., Wang, X., Su, L., Zhang, Z., Wu, S., Tao, Z., Jiang, Y., Xie, P., Huang, F., & Zhou, J. (2025). Towards General Agentic Intelligence via Environment Scaling. arXiv preprint arXiv:2509.13311. https://arxiv.org/abs/2509.13311
- OpenAI. (2025). Introducing GDPval: Measuring Model Performance on Economically Valuable Tasks. OpenAI Blog. https://openai.com/index/gdpval/
- Chakraborty, S., et al. (2025). Process Reward Models for LLM Agents: Practical Framework and Directions. arXiv preprint arXiv:2502.10325. https://arxiv.org/abs/2502.10325
Armin Parchami is the Director of Research Engineering at Snorkel AI, where he leads work on synthetic data, data quality, and model fine-tuning. He previously held technical leadership roles at Ford and Nokia Bell Labs, focusing on multimodal AI and autonomy. His work centers on moving research into production.