Archived

SnorkelUnderwrite

An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.

Overview

To date, most agentic and reasoning benchmarks have centered around tasks in the STEM domains. However, in the business world, many problems require domain-specific reasoning and behavior in messier ecosystems that require logic with metadata and tool combinations. To address this gap, we have developed a set of tasks that challenges AI agents in real-world ways, using a complex enterprise domain: commercial insurance underwriting.

The SnorkelUnderwrite benchmark is multi-turn, requiring AI agents to effectively interact with underwriters to help them solve their tasks by not only reasoning over tools, but by also asking the underwriters informative questions.

Leaderboard

Rank	Model	Score
1	GPT-5.4	91%
2	Claude Opus 4.1	86.3%
3	gpt-5	83.33%
4	Grok 4	83.3%
5	Grok 4 Fast Reasoning	81.33%
6	Grok 3	78%
7	o4 mini	78%
8	Claude Opus 4	77%
9	o3	77%
10	Claude Sonnet 3.7	74.6%
11	Claude Sonnet 4	72.3%
12	gpt-5-mini	71.67%
13	Kimi-K2-Thinking	71.3%
14	GPT-4.1	70.6%
15	Gemini 2.5 Flash	61%
16	Nova Premier	57%
17	Gemini 2.5 Pro	56.3%
18	Nova Pro	52.3%
19	gpt-5-nano	47%
20	Llama 3.3 70B	46.3%
21	Llama 4 Maverick	46.3%
22	Llama 4 Scout	44.3%
23	o3-mini	44.3%
24	Nova Lite	40%
25	Mistral Large	38.3%
26	Codestral	34%
27	Nova Micro	31%
28	gpt-oss-120b	30%
29	Magistral Medium	29.3%
30	Command R+	25.7%
31	Qwen 3 235B	21.3%
32	Llama 3.1 405B	20%
33	Command R	15.3%

Data sample

The trace below shows a sample conversation from the benchmark:

User

Could you guide me toward the correct six-digit NAICS identifier for this company?

Agent

I'll help you find the correct NAICS (North American Industry Classification System) code for the company. To do this effectively, I need some information about the company's primary business activities. What is the company's main business or industry?

User

Managing and holding securities of startups.

Agent

I'll help you find the correct NAICS code for a company that manages and holds securities of startups. Let me look at the available tables to find relevant information about NAICS codes.

TOOL CALL

list_tables({})

Tool :

["appetite_guide", "lobs", "naics", "naics_2012_2017", "naics_2022_2017", "naics_2digit", "small_business_definition", "states", "supply_chain_emissions_statistics"]

Agent

Now let me search for NAICS codes related to managing and holding securities or startups.

TOOL CALL

read_query({
  "query": "SELECT Code, Title, Description
    FROM naics
    WHERE Title LIKE '%securities%'
    OR Title LIKE '%holding%'
    OR Title LIKE '%investment%'
    OR Description LIKE '%holding%'
    OR Description LIKE '%startup%'
    ORDER BY Code"
})

Agent

FINAL ANSWER:
523910. The reason is that the company is primarily engaged in managing and holding securities of startups, which aligns with the NAICS code 523910 for “Miscellaneous Intermediation.”…

Dataset

The evaluation dataset is publicly available on Hugging Face. It covers 6 distinct task types related to small business insurance underwriting, with multi-turn conversational traces grounded in realistic underwriter workflows.

Dataset

Methodology

metric

Overall accuracy via LLM-as-a-Judge (GPT 4.1), comparing the agent’s final answer against a programmatically generated reference.

judge agreement

94.5% agreement on a balanced random sample of 200 conversations, validated against human annotations in Snorkel Evaluate.

scope

All traces are scored, including those where agents failed to produce a final answer due to recursion errors or premature termination.

failure rate

Agent failures observed at <1% for closed-source models and 10–30%+ for open-source models.

Behind the benchmark

We built the system in LangGraph with Model Context Protocol (MCP) and ReAct Agents. We engaged with our network of Chartered Property Casualty Underwriters (CPCUs) to create crucial components of the system, with a diverse sample dataset covering 6 distinct types of tasks, all related to applications for insurance by small businesses. Many of the tasks include subtasks involving more nuanced, complex underwriting logic. In each conversation, the underwriter has one of these specific tasks to solve. The tasks require an average of 3–7 steps of reasoning and tool use, with a total of 10–20 conversational turns.