Evaluating AI Agents for Insurance Underwriting

In this post, we will show you a specialized benchmark dataset we developed with our expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark uncovers several model-specific and actionable error modes, including basic tool use errors and a surprising number of insidious hallucinations from one provider. This is part of an ongoing series of benchmarks we are releasing across verticals and domains.

Want to see how this benchmark was built?
Go behind the scenes of our dataset design in Building the Benchmark: Inside Our Agentic Insurance Underwriting Dataset

Motivation

Our agent-based insurance benchmark was motivated by observations we have made working with customers and in the field of AI more generally: The past 9-12 months have witnessed an explosion in agents and models capable of interacting with larger ecosystems via tool use. The value proposition in enterprise settings is strong, promising that models, even teams of models, can solve more complex tasks with more autonomy.

However, AI agents in enterprise settings are often inaccurate and inefficient—with test-time compute that is larger than necessary. These agents tackle business problems with the awkwardness of a fresh college graduate naive to the business world, with generated answers that sound informed by textbooks but fall apart as soon as you dig beneath the surface.

This happens because they are not tuned to the critical details of the enterprise problem. Research and development in the AI field have largely focused on easily verifiable settings like coding and math, and simple, generic use cases where off-the-shelf checks suffice. This does not easily translate to enterprise settings.

The Snorkel Research team has been addressing this gap based on observations from internal experiments and real customer use cases. Effective agent-based solutions must account for:

The IT ecosystem it lives in
How end users will interact with it and consume the outputs.

This turns out to be more challenging for models than one would expect based on reasoning benchmarks. To enable actionable insights, we have been developing a series of benchmarks that include:

Reasoning over proprietary knowledge that models have never seen before
Tool use in complex, noisy IT ecosystems
Multi-turn interaction with users with uncertainty

Our goal is similar to open source initiatives such as Tau-Bench, but with intensive engagement of our expert network to ensure realistic, high-quality samples.

The Dataset

Commercial property and casualty insurance underwriters routinely make highly nuanced decisions about risk that require access to deep information about businesses and adherence to Byzantine business rules and government regulations. Their role offers a compelling example of the challenges AI agents will face in enterprise settings.

To this end, we designed an underwriting dataset that offers all of the properties above and can be used to evaluate agents powered by state-of-the-art models. To do so, we leveraged our Data-as-a-Service Expert Network, working with experienced Chartered Property Casualty Underwriters (CPCUs).

Our goal was to create a distilled representation of the initial stages of underwriting for small businesses in North America—initial applications for insurance at a fictional company, All National Insurance. Importantly, All National Insurance prefers to sell directly to customers, so no insurance agents or brokers are involved, putting the onus on the system to gather basic information. The data capture interactions in which a junior underwriter has information about the applicant that is occasionally incomplete, with tasks that they need help with from our AI copilot. The AI copilot, in turn, has access to resources that include databases and underwriting guidelines in long-form documents.

The AI copilot tech

We developed the agentic system in LangGraph with Model Context Protocol (MCP). We leveraged the flexibility of those frameworks to work with a wide variety of AI models, both open and closed source. We wrapped up each AI model we benchmarked as a ReAct agent.

Future work on these datasets will explore alternative frameworks and methodologies, but for benchmarking purposes we wanted to begin where we see many practitioners start developing when they prototype.

The tasks

We developed six basic types of tasks:

Whether the type of insurance, or “line of business,” being applied for was “in appetite,” which is an initial screening to see if the company attributes are acceptable to All National Insurance.
What, if any, other types of insurance the underwriter should offer the applicant based on their characteristics.
Whether the applicant qualifies as a “small business.”
The appropriate classification of the applicant based on their operations using the North American Industry Classification System (NAICS)
What types of limits the underwriter should offer the applicant if they are in appetite.
What types of deductibles the underwriter should offer the applicant if they are in appetite.

The companies

We synthesized thousands of fictional companies to represent the broad spectrum of small businesses in North America. A key aspect: feedback from CPCUs on the realism of those business profiles (eg, whether a given company would ever conceivably apply for small business insurance). Specifically, we sampled NAICS codes and business statistics and worked with a frontier model to generate structured profiles that we could then use in each sampled task.

The underwriters

We gave underwriters limited information about the applicants, challenging the AI assistant to ask the right questions to solve the task. Specifically, underwriters had information they might receive in the real world if an applicant were briefly filling out an online form or sending an email:

Company name
Verbose description of the applicant’s operations
Number of employees
Annual revenue
Total annual payroll
Number of automobiles operated and owned (if auto insurance was relevant)
Basic description of the property (if property insurance was relevant)
Location

The IT ecosystem and reasoning challenges

Our distilled representation included a database with several tables and free text guidelines, with the overall challenge to the AI copilot to reason with respect to:

Asking the right questions of the underwriter
Determining which tools and resources were relevant to a given task
Using those tools and resources to solve the problem

Importantly, our fictional system included useful metadata about resources, so the AI copilots theoretically had everything they needed to solve these problems. However, the correct sequence of tool use behavior was sometimes very complex.

Example 1: To determine whether an applicant even qualifies as a small business, AI copilots had to:

Find the proper NAICS classification code from the 2012 version of the schema (the schema are revised about every 5 years)
Use this code to query a table from the US Small Business Administration on:
1. Which feature of the business to use for qualification (number of employees vs annual revenue)
2. The value of that threshold

The AI copilots only had primary access to information about 2022 NAICS codes and mappings to the 2012 version via two other tables. So, they had to enter a chain of SQL queries to determine the correct criteria and thresholds, interacting with underwriters to obtain the information they needed in the process.

Example 2: To determine whether an applicant was in appetite for property insurance, AI copilots had to:

Read the free-text underwriting guidelines.
Figure out whether the applicant was in a special cohort of class codes (related to real estate).
If the applicant was, gather information about the property
Classify the property construction, using the guidelines to make a final decision.

These are just two examples. The dataset contained many others related to appetite, limits, and deductibles (and we are only scratching the surface here in our distilled representation!).

How we involved our expert network

Our network of CPCUs was vital here for developing the data. One of our key focus areas here at Snorkel is challenging models with realistic, enterprise-relevant “gotchas.” So, we worked hand in hand with CPCUs to ensure our distilled scenarios resembled the real world sufficiently enough to constitute a useful benchmark in the insurance vertical. To that end, our CPCU network worked with us over several iterations on both individual samples of data as well as the overall guidelines and data tables. They also helped develop business rules, realistic company profiles, and appropriate underwriting responses.

AI performance: evaluation and insights

Good benchmark data isn’t just about contests amongst frontier models. A useful dataset is actionable.

To that end, we are evaluating models over a number of criteria and slices of data that represent the perspectives of practitioners and business stakeholders. The efficacy of a complex AI system is not simply about academic correctness. It involves measures of efficiency (cost), the ability to interact with users and forage for information appropriately, the ability to make decisions under uncertainty, and the ability to solve for incompletely defined business objectives.

As part of this series, we will provide granular insights into model performance and failure modes. Here we will highlight some basics as of the date of this post.

Evaluation criteria

Using Snorkel’s evaluation suite, we have developed several scalable measures:

Task solution correctness: Here, we have simple reference answers for each sample that we generated with a combination of expert input and programmatic techniques. We used a frontier model to compare by fact-checking without dinging AI copilots for irrelevant information.
Task solution conciseness: This criteria complements the first, measuring how much of the AI copilot’s information and insight was actually relevant to the underwriting task.
Tool use correctness: This measure focuses on basic tool use. For example, were SQL queries properly written and executed?
Tool use efficiency: Here, we focused on the efficiency of the overall plan for tool use and task solution. We observed a wide range of approaches from models, many of which struggle to plan effectively ahead of time. Instead, these models go through a trial-and-error process of using tools incorrectly and revising the way they use those tools based on the results.

Results

Our task solution correctness criteria is the most important. Our leaderboard highlights a wide range of accuracies across frontier models, from the single digits up to ~80%.

Importantly, we see a tradeoff in test-time compute and accuracy, with the highest performing model showing the largest consumption of output tokens by far:

Task accuracy by two different components of test-time compute. Each datapoint is a single model average across conversations.

How much of this is driven by models taking more turns to find the information it needs to answer questions, vs. consuming more tokens? We dove deep on test-time compute and found significant correlations between accuracy and the number of turns the AI copilots had with the underwriter (probing for information) as well as the number of tools used. There were some notable exceptions, however. For example, one model took an average of 7 turns with the underwriter, only to achieve a task accuracy score of about 55%. Inspection of the traces indicated that this model simply struggled to ask the right questions, even when it could use the tools correctly. These findings led us to look more closely at efficiency, which we will briefly expand on below.

We also see interesting patterns across tasks:

Task	Accuracy
Deductibles	0.784
Business Classification	0.772
Policy Limits	0.762
Appetite Check	0.615
Product Recommendations	0.377

Accuracy by task, averaged across models. Accuracy here is computed after removing conversations with basic errors such as recursion errors in LangGraph.

Unsurprisingly, business classification (using 2022 NAICS codes) was one of the easiest across models. This is because it is a basic task required for most of the others. If the agent gets that one wrong, it likely fails at many others. Policy limits and deductibles were also easy because underwriting guidelines contained the defaults applicable a large percentage of the time. So, if the AI copilot could read the guidelines it had a good shot here.

The most challenging tasks forced models to use at least 3 or 4 tools and compose them correctly (using the results from one tool to use the next, etc), probing the underwriter for information along the way. We see this in appetite checks and product recommendations, in which some undetermined number of additional products need to be suggested (more on that below). But we also see this in important subsets of the policy limits and deductible tasks, which require more nuanced underwriting logic. For example, even though models overall performed the best on tasks involving deductibles, they were 25 points less accurate on average on this task for auto policies because of important exceptions in the underwriting guidelines. These exceptions required proper business classification and information from the underwriter about the number of vehicles owned and operated.

Error modes

Beyond these basic stats, we leveraged our evaluation criteria to uncover interesting behaviors that suggest a need to develop models along many different axes.

Two examples:

Tool use errors: Across models, including top performers, agents made at least one tool call error in 36% of the conversations despite access to the metadata required to use tools properly. Surprisingly, we found no correlation with overall performance. In fact, even the three most accurate models made tool call errors in 30-50% of the conversations, often going back after basic tool call errors to retrieve metadata and redo the tool call correctly. We could have engineered prompts to direct models towards this behavior (and our experiments show that this works), but we aimed to simulate a realistic enterprise setting. Real, deployed copilot systems include dozens of possible tools and a combinatorially high and unwieldy number of combinations, making it impractical to simply engineer system prompts for every edge case. In addition, there is no reason why a frontier model like o4-mini, Gemini 2.5 Pro or Claude 4 Sonnet should require that kind of hand-holding. If they can answer questions about quantum field theory, why can’t they figure out what columns are in a table before writing an SQL query?

*Example trace with o4-mini showing a tool use error followed by a correction after looking up the table schema.*

Hallucinations based on pretrained domain knowledge: We observed a distinct error mode related to hallucinations based on pretrained domain knowledge. Some of the highest performing models have clearly been trained on insurance data, but almost to a fault. They occasionally hallucinated guidelines that might appear online but were not actually contained in the provided documentation. For example, in the task involving insurance product recommendations, the top performing models from OpenAI hallucinated several products not in the guidelines 15-45% of the time. These hallucinations not only led to misleading answers, but also to misleading questions to the underwriter—probing for information that was ultimately irrelevant to the task.

Example trace ending showing hallucination from the top performing model overall. None of the listed insurance products were in the provided guidelines even though they are typical in other commercial insurance contexts.

These error modes illustrate a really basic point—contrary to one recent claim about the increasing irrelevance of proprietary knowledge, we continually see this relevance in real-world customer problems, and this benchmark dataset captures that. Like our customers, All National Insurance has its “secret sauce” of underwriting that you won’t find online. Models that hallucinate generic knowledge within a vertical work against that, inserting subtle but potentially catastrophic factual inaccuracies.

Conclusion

This dataset and the initial findings across models illustrate the key point: even the most performant frontier models struggle in surprising ways to solve for tasks they have never seen before.

From one vantage point, the tasks are indeed complex, requiring multiple tools in the correct order and nuanced, proprietary reasoning that can only be found in the associated documents. But from another, more basic vantage point, these tasks are still nowhere near the complexity we see in other challenging academic benchmarks. They theoretically require no more than 4 or 5 steps to solve with the right information from the user.

There is something interesting at play here related to user interaction and tool use. Just as a theoretical physicist may not be your best bet to build a bridge, a frontier reasoning model taken off the shelf is not your best bet to solve your business problem. It requires careful evaluation and development with benchmark data that contain the skills relevant to your domain of expertise. We’ve shown here how Snorkel’s Expert Data service can be leveraged to that end.

Evaluating AI Agents for Insurance Underwriting

Motivation