A deep dive into how we design and build agentic environments for training and evaluating LLMs at Snorkel AI

Modern large language models (LLMs) don’t just learn from static datasets, they learn by interacting with environments that provide tools, state, and structured feedback. In this post, we unpack what makes a high-quality reinforcement learning (RL) environment for LLMs, highlight recent benchmark advances (Meta ARE, OpenAI GDPval), and show how we build realistic, enterprise-grade environments at Snorkel AI.

What is an “RL environment” for LLMs?

An RL environment is a sandbox that powers the action space in which an agent works:

  1. Context: The available resources and constraints around the agent.
  2. Actions / tools: APIs, commands, or functions the agent can invoke.
  3. Observations / state: What the agent sees after each action.
  4. Evaluation signals: Verifiers, rubrics, or outcome checks that tell us how well it did.

Combined with the tasks that compose the evaluations or benchmarks, well-formed environments encode rules, goals, constraints, inputs/outputs, and provide structured signals like verifiers or rubrics; they’re designed to match real tasks, tools, and success criteria. 

The hard part is relevance and reliability at scale—goals are ambiguous, APIs drift, human validation is expensive, and teams often reinvent the structure per use case. We address this with expert-designed tasks, automated quality control, and fine-grained reward design.

Training vs. evaluation: two modes, one environment

We run the same environments in two modes:

  • Simulated Agentic Evaluation. Stress-test agents against expert-defined tasks via APIs or hosted environments; use automatic verifiers or expert graders for traces. Think “flight simulator” before real users. 
  • RL Gyms. Deploy the same tasks and reward functions inside your training loop (in-VPC if needed), with standardized verifiers and reward design.

What “good” looks like

We believe strong RL environments share these properties:

  1. Faithful tool surface. The agent’s actions map to the tools in the production use case. For web tasks, this means full websites with validators; for terminals, real CLIs and unit tests. WebArena, for example, hosts functional, self-contained sites across e-commerce, forums, coding, and CMS, plus validators to check task completion. 
  2. Deterministic, verifiable rewards. Explicit tests that programmatically assert success (unit tests, state checks) beat vague “looks good” judgments.
  3. Process + outcome signals. Capture the action/trace and the final output; grade both with rubrics/LLM-as-judge or automated verifiers. 
  4. Distribution control. You need knobs to vary difficulty and diversity without breaking reproducibility. Our SnorkelSpatial benchmark exemplifies this in a simplified 2D environment, with variables for the number of particles, the number of movements, and the size of the game board.
  5. Dynamics & rules. Provides the logic of how the environment changes with each action. This could be as simple as moving to a next question or as complex as updating a game state or database. Environments often maintain an internal state (e.g., the state of a virtual world or a user session).

Examples of RL environments

Terminal / OS

Evaluate end-to-end system reasoning with real shells, package managers, filesystems, compilers, and tests.

  • Terminal-Bench is the most active public effort to do this at scale; it stresses long-horizon CLI tasks and publishes harnesses and Docker images. See our blog post about Snorkel’s contributions to Terminal-Bench.
  • AgentBench’s OS track complements this with targeted Bash tasks and reproducible Ubuntu containers. 

Browser / web

Full-stack web interaction with long-horizon navigation and grounded goals.

  • WebArena ships reproducible websites and programmatic validators (GPT-4 agents achieve ~14% success vs. ~78% human).
  • WebShop focuses on shopping flows with 1.18M real products and >12K human goals.
  • WALT (Web Agents that Learn Tools) reverse-engineers websites to abstract the human-focused UX into tool calls.

General assistant / multi-tool

Tasks that require combining search, browsing, tool orchestration, and multi-turn dialog.

  • GAIA is an evaluation of whether “your assistant actually helps”—humans ~92%, GPT-4+plugins ~15%. 
  • MINT stresses multi-turn tool use and language feedback with a Python tool-calling harness.

Tool mastery & safety

Focus on API/tool use, failure robustness, and red-teaming.

  • ToolLLM introduces ToolBench, which is a large-scale API tool-use data and a train/eval framework for using 16K+ real APIs.
  • ToolEmu emulates tools and measures safety risks across 36 high-stakes tools and 144 cases, finding a significant fraction of real-world-valid failures makes it useful for red-teaming agents before production. 

Synthetic data via verifiers

Generated reasoning tasks + verifiers to scale cross-domains.

  • Loong (CAMEL-AI) synthesizes chain-of-thought trajectories with plug-in verifiers.

Dynamic / asynchronous environments

Environments that evolve independently of the agent’s actions, introducing temporal events, noise, and state drift.

  • Meta ARE is a platform for scalable agent environments with asynchronous dynamics and evolving world state.
  • Gaia2, built on ARE, tests agent adaptability, ambiguity handling, collaboration, and temporal constraints in dynamic settings.

Economically valued / deliverable-centric evaluations

Benchmarks built around real-world knowledge work, artifacts, and domain context.

  • GDPval, introduced by OpenAI, assesses performance on 1,320 tasks across 44 professions, grading AI outputs against expert deliverables. While GDPval today is largely one-shot (non-interactive), it shifts the evaluation emphasis toward economic value, artifact correctness, and domain realism.

How we build environments at Snorkel

Our environments are expert-designed with comprehensive reward feedback. We compose scenarios, tools, rules, and feedback channels, then validate with domain experts and automated quality control (QC) to ensure realism and repeatability. Environments ship with gold traces, rubrics, and verifiers so you can use them for both evaluation and RL. 

Step 1: Collecting requirements with domain experts

Every environment begins with a product requirements document (PRD) drafted by a domain expert. For example, an insurance underwriter, financial analyst, or CRM support specialist writes out the actual workflow: what tools are used, what data sources are accessed, what steps are typically required, and how success is judged.

Step 2: Tooling and data design by technical experts

Next, coding and data engineers translate the domain PRD into a set of executable tools, APIs, and data stubs. If the workflow involves SQL queries, underwriting guidelines, or CRM records, we build representative databases, services, and schemas. These tools become the agent’s “action space,” while synthetic or anonymized datasets simulate the live systems. Keys to success:

  • Embed deliverables & context. Tasks should carry file contexts, history, and domain artifacts, not bare prompts (e.g., TerminalBench & GDPVal).

Step 3: Reward and rubric development

In parallel, grading criteria are formalized through process-based checks (Did the agent use the right tools in the right order?) and outcome-based checks (Was the final answer correct, safe, and aligned?). Rewards are implemented as deterministic verifiers, unit tests, or fine-grained rubrics. This ensures every environment run yields structured feedback signals —not just binary pass/fail, but rich traces on reasoning, tone, and alignment. Keys to success:

  • Instrument for both “how” and “what.” Score the trace (Was the tool sequence safe/efficient?) and the outcome (Is the answer correct?), not just one or the other. 
  • Prefer verifiable endpoints. Unit tests, state asserts, DB diffs, or form validators are less subjective and easier to scale than open-ended prose grading. This is how Terminal-Bench and WebArena keep evaluation honest. 

Step 4: Scenario generation and templatization

We build templated scenarios with controllable variation knobs (difficulty, tool availability, data parameters). This lets us scale environments to thousands of distinct but structurally consistent tasks, reducing brittleness while ensuring diversity. Automation handles sampling and QC, while experts review difficult edge cases. Keys to success:

  • Vary distribution, not definitions. Keep task schemas stable while sampling tools, constraints, and difficulty to avoid overfitting. 
  • Plan for drift. APIs and web UIs change; use self-hosted tools with controlled data when possible, and leverage health checks.

Step 5: Review and validation

Environments and their associated traces undergo dual review:

  • Domain experts validate realism and coverage (Does this reflect how the task is done in the real world?).
  • Technical experts validate reward quality and catch environment hacking or brittle shortcuts.

Snorkel’s automated QC platform supplements this by checking for reproducibility, distribution, and coverage.

Step 6: Gold traces

For each environment, annotators or subject matter experts generate gold traces. These traces serve as ground truth for evaluation and as high-quality training data for RL or supervised fine-tuning (SFT). For example, in insurance underwriting, gold traces capture 3–7 steps across SQL, guidelines, and appetite matrices.

Summary and next steps

Environments provide the critical testing and training ground for agentic systems that tackle tasks that continue to grow in complexity. The process outlined above is how the Snorkel team builds RL environments and partners with domain experts to develop our agentic benchmarks. Examples of this work are our SnorkelUndewrite and Finance Reasoning public benchmarks, and our contributions to Terminal-Bench 2.0.

Looking ahead, we see how quickly agent capabilities advance, and we’re very excited to push the dynamism in environments further. Our recent paper on automating benchmark design demonstrates how we enable RL environments to continue to raise the bar as agents improve. If you’re building an agentic system and have questions about developing environments, come talk to us.

References & further reading