SnorkelUnderwrite

An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.

Overview

To date, most agentic and reasoning benchmarks have centered around tasks in the STEM domains. However, in the business world, many problems require domain-specific reasoning and behavior in messier ecosystems that require logic with metadata and tool combinations. To address this gap, we have developed a set of tasks that challenges AI agents in real-world ways, using a complex enterprise domain: commercial insurance underwriting.

The SnorkelUnderwrite benchmark is multi-turn, requiring AI agents to effectively interact with underwriters to help them solve their tasks by not only reasoning over tools, but by also asking the underwriters informative questions.

We built the system in langgraph with Model Context Protocol (MCP) and ReAct Agents. We engaged with our network of Chartered Property Casualty Underwriters (CPCUs) to create crucial components of the system, with a diverse sample dataset covering 6 distinct types of tasks, all related to applications for insurance by small businesses. Many of the tasks include subtasks involving more nuanced, complex underwriting logic. In each conversation, the underwriter has one of these specific tasks to solve. The tasks require an average of 3-7 steps of reasoning and tool use, with a total of 10-20 conversational turns.

Learn more about the development process in our blog post: Building the Benchmark: Inside Our Agentic Insurance Underwriting Dataset.

Model Comparison

Loading...

Data Sample

Visualization of the SnorkelUnderwrite dataset showing the distribution of tasks and model performance across different commercial underwriting scenarios. A sample of the dataset can be found here.

Graph reasoning example

Evaluation Methodology

We evaluate AI agent performance over several key axes that include correct tool use, response conciseness and overall accuracy in solving the task. Here we show overall accuracy, computed with an LLM-as-a-Judge (GPT 4.1) that compares the Agent's "final answer" with a reference answer that is programmatically generated from the task data.

After alignment in Snorkel Evaluate, the LLM-as-a-Judge has an agreement rate of 94.5% on a balanced, random sample of 200 conversations, with human annotations.

Note: We report accuracy on all traces, including those in which AI agents failed to provide a final task solution, either due to recursion errors in langgraph or errant behavior from the AI agent that forced the conversation to conclude prematurely. We have observed these failures much more frequently among open source models (<1% among closed source, 10-30+% among open source).

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.