Introduction

This post describes our specialized benchmark dataset developed through our Data-as-a-Service (DaaS) expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark uncovers a number of model-specific and actionable error modes that include basic tool use errors and a surprising number of insidious hallucinations from one provider in particular. Let’s get into the gory details! 

To see how top AI models performed on these tasks, read our companion post: Evaluating AI Agents for Insurance Underwriting.

Motivation

The past 9-12 months have witnessed an explosion in agents and models that are capable of interacting with larger—often specialized—ecosystems via tool use. The value proposition in enterprise scenarios is strong—but AI agents in these settings are often inaccurate and inefficient. These agents tackle business problems with the awkwardness of a fresh college graduate who is naive to the business world, with generated answers that sound informed but fall apart as soon as you dig beneath the surface.

The cause? Research and development in the AI field have largely focused on easily verifiable settings like coding and math, and simple generic use cases where off-the-shelf checks suffice–and data is plentiful. However, in specialized settings, off-the-shelf approaches fail and expert data is critical. To illustrate this idea, Snorkel Research is developing a series of benchmarks that include challenging features, such as:

  • Reasoning over specific and proprietary knowledge that models have never seen before
  • Tool use in complex, noisy, and specialized ecosystems
  • Multi-turn interaction with users with uncertainty

The Dataset

Commercial property and casualty insurance is a compelling example of all three challenges. Underwriters routinely make highly nuanced decisions about risk that require access to deep information about businesses and adherence to byzantine business rules and government regulations. We designed an underwriting dataset that offers all of the properties above and can be used to evaluate agents powered by state-of-the-art models. To do so, we leveraged our Data-as-a-Service expert network, working with experienced CPCUs. 

Our benchmark starts by simulating a company, All National Insurance, that sells directly to customers—thousands of fictional companies we created. The data capture interactions in which a junior underwriter has information about an applicant that is occasionally incomplete, with tasks that they need help with from our AI copilot. The AI copilot, in turn, has access to resources that include databases and underwriting guidelines in long-form documents. The tasks include, what, if any, other types of insurance the underwriter should offer the applicant based on their characteristics, whether the applicant qualifies as a “small business”, and so on. We gave underwriters limited information about the applicants, challenging the AI assistant to ask the right questions to solve the task. Specifically, underwriters had information they might receive in the real world if an applicant were briefly filling out an online form or sending an email, e.g., company name, number of employees, annual revenue.

We developed the agentic system in LangGraph with Model Context Protocol (MCP). We leveraged the flexibility of those frameworks to work with a wide variety of AI models, both open and closed source. We wrapped up each AI model we benchmarked as a ReAct agent. A high-level block diagram for the system is shown below:

Expert Data & Verification

Our expert network of CPCUs was vital for developing the data. Our goal is to challenge models with realistic, enterprise-relevant settings. We used CPCUs to ensure our scenarios are realistic, meaningful, and constitute a useful benchmark in the insurance vertical. Experts iterated on individual samples of data as well as the overall guidelines and data tables. They also helped develop business rules, realistic company profiles, and appropriate responses. For example, we repeatedly employed feedback from CPCUs on the realism of those business profiles (eg, whether a given company would ever conceivably apply for small business insurance). 

We note that expert data and verification are key, as without these, the resulting benchmark is highly unrealistic and unlikely to match real-life use cases. Indeed, a major part of the difficulty in our benchmark comes from how it corresponds to meaningful real-world settings—rather than toy scenarios that are so often used to benchmark agentic systems. 

Evaluating Agent Capabilities

Our distilled representation included a database with several tables and free text guidelines, with the overall challenge to the AI copilot to reason with respect to:

  • Asking the right questions of the underwriter
  • Determining which tools and resources were relevant to a given task
  • Using those tools and resources to solve the problem

Importantly, our fictional system included useful metadata about resources, so the AI copilots theoretically had everything they needed to solve these problems. However, the correct sequence of tool use behavior was sometimes very complex. 

Example: determine whether an applicant even qualifies as a small business, AI copilots had to:

  1. Find the proper NAICS classification code from the 2012 version of the schema (the schema are revised about every 5 years)
  2. Use this code to query a table from the US Small Business Administration on:
    1. Which feature of the business to use for qualification (number of employees vs annual revenue) 
    2. The value of that threshold


The AI copilots only had primary access to information about 2022 NAICS codes and mappings to the 2012 version via two other tables. So, they had to enter a chain of SQL queries to determine the correct criteria and thresholds, interacting with underwriters to obtain the information they needed in the process.

The dataset contained many others related to appetite, limits, and deductibles (and we are only scratching the surface here in our distilled representation!).

AI performance: evaluation and insights

Good benchmark data isn’t just about contests amongst frontier models—it is also actionable. To that end, we evaluate models over a number of criteria and slices of data that represent the perspectives of practitioners and business stakeholders. We provide granular insights into model performance and failure modes. Next we highlight some basics as of the date of this post.

Evaluation criteria

Using Snorkel’s evaluation suite, we developed several scalable measures, including task solution correctness, task solution conciseness, tool use correctness, and tool use efficiency.

Results

Our task solution correctness criteria is the most important. Our leaderboard highlights a wide range of accuracies across frontier models, from the single digits up to ~80%. Importantly, we see a tradeoff in test-time compute and accuracy, with the highest performing model showing the largest consumption of output tokens by far:

Task accuracy by two different components of test-time compute. Each datapoint is a single model average across conversations. 

How much of this is driven by models taking more turns to find the information it needs to answer questions, vs. consuming more tokens? We found significant correlations between accuracy and the number of turns the AI copilots had with the underwriter (probing for information) as well as the number of tools used—with some notable exceptions. For example, one model took an average of 7 turns with the underwriter, only to achieve a task accuracy score of about 55%. This model struggled to ask the right questions of the underwriter, even when it could use the tools correctly. 

We also see interesting patterns across tasks:

TaskAccuracy
Deductibles0.784
Business Classification0.772
Policy Limits0.762
Appetite Check0.615
Product Recommendations0.377

Accuracy by task, averaged across models. Accuracy here is computed after removing conversations with basic errors such as recursion errors in LangGraph.

Unsurprisingly, business classification (using 2022 NAICS codes) was one of the easiest across models: it is a basic task required for most of the others. If the agent gets that one wrong, it likely fails at many others. Policy limits and deductibles were also easy because underwriting guidelines contained the defaults applicable a large percentage of the time. So, if the AI copilot could read the guidelines it had a good shot here. 

The most challenging tasks forced models to use at least 3 or 4 tools and compose them correctly (using the results from one tool to use the next, etc), probing the underwriter for information along the way. We see this in appetite checks and product recommendations, in which some undetermined number of additional products need to be suggested (more on that below). But we also see this in important subsets of the policy limits and deductible tasks where the more nuanced underwriting logic is required. 

Error modes

Beyond basic stats, we leveraged our evaluation criteria to uncover interesting behaviors that suggest a need to develop models along many different axes. 

Two examples: 

  • Tool use errors: Across models, including top performers, agents made at least one tool call error in 36% of the conversations despite access to the metadata required to use tools properly. Surprisingly, we found no correlation with overall performance. In fact, even the three most accurate models made tool call errors in 30-50% of the conversations, often going back after basic tool call errors to retrieve metadata and redo the tool call correctly. We could have engineered prompts to direct models towards this behavior (and our experiments show that this works), but we aimed to simulate a realistic enterprise setting. Real, deployed copilot systems include dozens of possible tools and a combinatorially high and unwieldy number of combinations, making it impractical to simply engineer system prompts for every edge case. In addition, there is no reason why a frontier model like o4-mini, Gemini 2.5 Pro or Claude 4 Sonnet should require that kind of hand-holding. If they can answer questions about quantum field theory, why can’t they figure out what columns are in a table before writing a sql query?

Example trace with o4-mini showing a tool use error followed by a correction after looking up the table schema. 

  • Hallucinations based on pretrained domain knowledge: We observed a distinct error mode related to hallucinations based on pretrained domain knowledge. Some of the highest performing models have clearly been trained on insurance data, but almost to a fault. They occasionally hallucinated guidelines that might appear online but were not actually contained in the provided documentation. For example, in the task involving insurance product recommendations, the top performing models from OpenAI hallucinated several products not in the guidelines 15-45% of the time depending on the model. These hallucinations not only led to misleading answers, but also to misleading questions to the underwriter—probing for information that was ultimately irrelevant to the task.

Example trace ending showing hallucination from the top performing model overall. None of the listed insurance products were in the provided guidelines even though they are typical in other commercial insurance contexts.

These error modes illustrate a really basic point– contrary to one recent claim about the increasing irrelevance of proprietary knowledge, we continually see this relevance in real-world customer problems, and our dataset captures that. Like our customers, All National Insurance has its “secret sauce” of underwriting that you won’t find online. Models that hallucinate generic knowledge within a vertical work against that, inserting subtle but potentially catastrophic factual inaccuracies. 

Conclusion

Our dataset and findings across models illustrate the key point: even the most performant frontier models struggle in surprising ways to solve for tasks they have never seen before. From one vantage point, the tasks are indeed complex, requiring multiple tools in the correct order and nuanced, proprietary reasoning that can only be found in the associated documents. But from another, more basic vantage point, these tasks are still nowhere near the complexity we see in other challenging academic benchmarks. They theoretically require no more than 4 or 5 steps to solve with the right information from the user.

There is something interesting at play here related to user interaction and tool use. Just as a theoretical physicist may not be your best bet to build a bridge, a frontier reasoning model taken off the shelf is not your best bet to solve your business problem. It requires careful evaluation and development with benchmark data that contain the skills relevant to your domain of expertise. We’ve shown here how Snorkel’s Expert Data service can be leveraged to that end.