Back to benchmarks
Released April 04, 2026
Archived

SnorkelUnderwrite

An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.
Overview

To date, most agentic and reasoning benchmarks have centered around tasks in the STEM domains. However, in the business world, many problems require domain-specific reasoning and behavior in messier ecosystems that require logic with metadata and tool combinations. To address this gap, we have developed a set of tasks that challenges AI agents in real-world ways, using a complex enterprise domain: commercial insurance underwriting.

The SnorkelUnderwrite benchmark is multi-turn, requiring AI agents to effectively interact with underwriters to help them solve their tasks by not only reasoning over tools, but by also asking the underwriters informative questions.

Leaderboard

Rank Model Score
1 GPT-5.4
91%
2 Claude Opus 4.1
86.3%
3 gpt-5
83.33%
4 Grok 4
83.3%
5 Grok 4 Fast Reasoning
81.33%
6 Grok 3
78%
7 o4 mini
78%
8 Claude Opus 4
77%
9 o3
77%
10 Claude Sonnet 3.7
74.6%
11 Claude Sonnet 4
72.3%
12 gpt-5-mini
71.67%
13 Kimi-K2-Thinking
71.3%
14 GPT-4.1
70.6%
15 Gemini 2.5 Flash
61%
16 Nova Premier
57%
17 Gemini 2.5 Pro
56.3%
18 Nova Pro
52.3%
19 gpt-5-nano
47%
20 Llama 3.3 70B
46.3%
21 Llama 4 Maverick
46.3%
22 Llama 4 Scout
44.3%
23 o3-mini
44.3%
24 Nova Lite
40%
25 Mistral Large
38.3%
26 Codestral
34%
27 Nova Micro
31%
28 gpt-oss-120b
30%
29 Magistral Medium
29.3%
30 Command R+
25.7%
31 Qwen 3 235B
21.3%
32 Llama 3.1 405B
20%
33 Command R
15.3%

Data sample

The trace below shows a sample conversation from the benchmark:
User
Could you guide me toward the correct six-digit NAICS identifier for this company?
Agent
I'll help you find the correct NAICS (North American Industry Classification System) code for the company. To do this effectively, I need some information about the company's primary business activities. What is the company's main business or industry?
User
Managing and holding securities of startups.
Agent
I'll help you find the correct NAICS code for a company that manages and holds securities of startups. Let me look at the available tables to find relevant information about NAICS codes.
TOOL CALL
list_tables({})
Tool :
["appetite_guide", "lobs", "naics", "naics_2012_2017", "naics_2022_2017", "naics_2digit", "small_business_definition", "states", "supply_chain_emissions_statistics"]
Agent
Now let me search for NAICS codes related to managing and holding securities or startups.
TOOL CALL

read_query({
  "query": "SELECT Code, Title, Description
    FROM naics
    WHERE Title LIKE '%securities%'
    OR Title LIKE '%holding%'
    OR Title LIKE '%investment%'
    OR Description LIKE '%holding%'
    OR Description LIKE '%startup%'
    ORDER BY Code"
})

Agent

FINAL ANSWER:
523910. The reason is that the company is primarily engaged in managing and holding securities of startups, which aligns with the NAICS code 523910 for “Miscellaneous Intermediation.”…

Dataset

The evaluation dataset is publicly available on Hugging Face. It covers 6 distinct task types related to
small business insurance underwriting, with multi-turn conversational traces grounded in realistic
underwriter workflows.

Methodology

metric
Overall accuracy via LLM-as-a-Judge (GPT 4.1), comparing the agent’s final answer against a programmatically generated reference.
judge agreement
94.5% agreement on a balanced random sample of 200 conversations, validated against human annotations in Snorkel Evaluate.
scope
All traces are scored, including those where agents failed to produce a final answer due to recursion errors or premature termination.
failure rate
Agent failures observed at <1% for closed-source models and 10–30%+ for open-source models.

Behind the benchmark

We built the system in LangGraph with Model Context Protocol (MCP) and ReAct Agents. We engaged with our network of Chartered Property Casualty Underwriters (CPCUs) to create crucial components of the system, with a diverse sample dataset covering 6 distinct types of tasks, all related to applications for insurance by small businesses. Many of the tasks include subtasks involving more nuanced, complex underwriting logic. In each conversation, the underwriter has one of these specific tasks to solve. The tasks require an average of 3–7 steps of reasoning and tool use, with a total of 10–20 conversational turns.

From the blog

Image for Building the benchmark: inside our agentic insurance underwriting dataset

Building the benchmark: inside our agentic insurance underwriting dataset

In this post, we unpack how Snorkel built a realistic benchmark dataset to evaluate AI agents in commercial insurance underwriting....

Resources

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.