SnorkelUnderwrite

An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.

Overview

To date, most agentic and reasoning benchmarks have centered around tasks in the STEM domains. However, in the business world, many problems require domain-specific reasoning and behavior in messier ecosystems that require logic with metadata and tool combinations. To address this gap, we have developed a set of tasks that challenges AI agents in real-world ways, using a complex enterprise domain: commercial insurance underwriting.

The SnorkelUnderwrite benchmark is multi-turn, requiring AI agents to effectively interact with underwriters to help them solve their tasks by not only reasoning over tools, but by also asking the underwriters informative questions.

We built the system in langgraph with Model Context Protocol (MCP) and ReAct Agents. We engaged with our network of Chartered Property Casualty Underwriters (CPCUs) to create crucial components of the system, with a diverse sample dataset covering 6 distinct types of tasks, all related to applications for insurance by small businesses. Many of the tasks include subtasks involving more nuanced, complex underwriting logic. In each conversation, the underwriter has one of these specific tasks to solve. The tasks require an average of 3-7 steps of reasoning and tool use, with a total of 10-20 conversational turns.

Learn more about the development process in our blog post: Building the Benchmark: Inside Our Agentic Insurance Underwriting Dataset.

Data Sample

Visualization of the SnorkelUnderwrite dataset showing the distribution of tasks and model performance across different commercial underwriting scenarios. A sample of the dataset can be found here.

Evaluation Methodology

We evaluate AI agent performance over several key axes that include correct tool use, response conciseness and overall accuracy in solving the task. Here we show overall accuracy, computed with an LLM-as-a-Judge (GPT 4.1) that compares the Agent's "final answer" with a reference answer that is programmatically generated from the task data.

After alignment in Snorkel Evaluate, the LLM-as-a-Judge has an agreement rate of 94.5% on a balanced, random sample of 200 conversations, with human annotations.

Note: We report accuracy on all traces, including those in which AI agents failed to provide a final task solution, either due to recursion errors in langgraph or errant behavior from the AI agent that forced the conversation to conclude prematurely. We have observed these failures much more frequently among open source models (<1% among closed source, 10-30+% among open source).

Learn more

BLOG

Building the benchmark: inside our agentic insurance underwriting dataset

Claude Opus 4.1

86.3%

gpt-5

83.33%

Grok 4

83.3%

Grok 4 Fast Reasoning

81.33%

Grok 3

78%

o4 mini

78%

Claude Opus 4

77%

Claude Sonnet 3.7

74.6%

Claude Sonnet 4

72.3%

gpt-5-mini

71.67%

Kimi-K2-Thinking

71.3%

GPT-4.1

70.6%

Gemini 2.5 Flash

61%

Nova Premier

57%

Gemini 2.5 Pro

56.3%

Nova Pro

52.3%

gpt-5-nano

47%

Llama 3.3 70B

46.3%

Llama 4 Maverick

46.3%

Llama 4 Scout

44.3%

o3-mini

44.3%

Nova Lite

40%

Mistral Large

38.3%

Codestral

34%

Nova Micro

31%

gpt-oss-120b

30%

Magistral Medium

29.3%

Command R+

25.7%

Qwen 3 235B

21.3%

Llama 3.1 405B

20%

Command R

15.3%

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.

Talk to Snorkel