Data-Centric Development of an Enterprise AI Agent with Snorkel

We just announced some exciting new products that round out our unified AI Data Development Platform, supporting specialized, expert data development and labeling to evaluate and tune agentic AI systems.

In this blog, we’ll briefly see how we can use these two new products—Snorkel Evaluate and Expert Data-as-a-Service–to evaluate and develop a specialized agentic AI system for an enterprise use case:

First, we’ll build a benchmark dataset for our unique setting—defining what the agent should be expected to perform on and how.
Then, we’ll develop specialized evaluators to accurately grade our agent’s performance against custom metrics and expert judgement—defining how we expect the agent to perform, in a highly specialized and aligned way.
And finally, we’ll identify specific, fine-grained error modes—identifying where the agent has issues—and correct them, including through fine-tuning and reinforcement learning.

For this example, we’ll use an enterprise agent development process inspired by one of our Fortune 500 customers: building a specialized AI insurance underwriting agent. In this use case, we want our AI agent to answer questions about policy deductibles and coverages with the accuracy and trustworthiness of an expert underwriter. As we’ll see in the demo, even a mock-up of a real-world agentic scenario like this is challenging. Out-of-the-box approaches struggle both with baseline task achievement, introducing hallucinations, and also fail to provide trustworthy evaluations that assess system performance. This is where Snorkel’s AI Data Development Platform comes in.

Benchmark dataset curation

First: we can’t optimize what we can’t measure, so we start with evaluation—and for that, we need a benchmark dataset of the kinds of prompts, and correct responses and actions, that we want our AI agent to handle.

Here, we used Snorkel’s Expert Data-as-a-Service to build a representative expert dataset for our agentic insurance task that was:

Extremely high quality, developed with qualified insurance underwriters sourced by Snorkel, accelerated and augmented by our programmatic quality control technology.
Built to reflect real-world complexities, with an average of 3-7 steps of reasoning or tool use and as much as 20 conversational turns between the AI agent and a user.
Developed to be distributionally diverse, covering multiple specialized sub-task types.
And built to be challenging, with a top frontier LLM performance of 71%, and optimal, efficient routes taken only 35% of the time.

We’ve open sourced this dataset here, along with more details of its contents and construction.

Evaluator development

Next, we need to develop evaluators that automatically label or grade our agentic system’s performance against the underwriting agent tasks defined by our benchmark.

It’s obviously critical that these evaluators are accurate and aligned with our subject matter experts, use case-specific objectives, and enterprise standards.

However, like many practitioners in real-world enterprise settings: we find that off-the-shelf LLM-as-a-judge approaches—i.e., using an LLM (GPT-4.1) with a basic prompt–fail to be sufficiently trustworthy, only agreeing with experts about 70-75% of the time in the tasks we developed. And falling back to manual review by experts would have been far too slow—requiring hundreds of hours of review per development cycle.

We used advanced workflows in Snorkel Evaluate to do better, including:

Prompt development and tuning—to better align our LLM-as-a-judge;
Programmatic weak supervision—to leverage multiple prompts and labeling functions to develop even more powerful fine-tuned, distilled evaluators;
And SME annotation workflows—to validate and align with our SMEs.

*“Out of the box” LLM-as-a-judge evaluators vs. custom evaluators developed in Snorkel Evaluate, on three enterprise agent datasets, including the insurance use case; see details* *here*.

The result: we developed custom evaluators at 88-90% accuracy—enough to trust for our evaluations. We obtained similar results for custom evaluators on two other agent-based datasets—one in the financial domain that we will be releasing this summer (Fin-QA), the other an existing industry standard in the retail space (Tau-Bench).

Error analysis and tuning

Finally, we use Snorkel Evaluate to programmatically tag critical subsets of the benchmark dataset—an operation we call slicing–in order to reveal fine-grained error modes that are actionable to correct. With this, we were quickly able to improve one of the leading LLMs on our benchmark, Claude Sonnet 3.7, by 15 points, using prompting and agentic system calibration to improve it from its strong out-of-the-box performance towards closer to production-deployable accuracy.

We can also use the evaluation approach developed in Snorkel Evaluate above to automatically improve the agentic system via reinforcement learning (RL)—which we view as the future of how AI agents are developed. To do this, we use tuning workflows in Snorkel Develop to distill our evaluators into a process reward model (PRM) to drive reinforcement learning gains. This includes sampling with the PRM and direct use of the PRM signal to help steer models towards both accuracy and efficiency by rewarding correct answers that are generated via efficient tool use.

If you want to dig deeper—we’ve open-sourced an initial sample of the benchmark dataset we built here. We’ll be releasing more here–including the full benchmark dataset, PRMs for reinforcement learning along with results here, and more!– as part of a detailed walkthrough and technical report at our June 26th event—come join if interested to learn more!

This was a very brief overview- but we’ve actually gone through an end-to-end walkthrough of evaluation-driven development for agentic AI in a specialized enterprise setting.

As we saw:

Off-the-shelf models, powerful as they are, are not enough either for powering or evaluating specialized, mission-critical AI agents in real enterprise settings.
The key missing steps are around specialized data development and labeling.
We can close these gaps and accelerate enterprise AI development with Snorkel’s AI Data Development Platform.

If you’re building similar enterprise agentic AI systems, and parts of this demo resonated—let’s talk. And, mark your calendar for June 26th to join our event on developing specialized enterprise AI agents.

Finally: we’ll be regularly open-sourcing more benchmark datasets, model artifacts, and demos like this in the weeks to come—stay tuned!