Archived
SnorkelUnderwrite
An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.
Overview
To date, most agentic and reasoning benchmarks have centered around tasks in the STEM domains. However, in the business world, many problems require domain-specific reasoning and behavior in messier ecosystems that require logic with metadata and tool combinations. To address this gap, we have developed a set of tasks that challenges AI agents in real-world ways, using a complex enterprise domain: commercial insurance underwriting.
The SnorkelUnderwrite benchmark is multi-turn, requiring AI agents to effectively interact with underwriters to help them solve their tasks by not only reasoning over tools, but by also asking the underwriters informative questions.
Leaderboard
| Rank | Model | Score |
|---|---|---|
| 1 | GPT-5.4 |
91%
|
| 2 | Claude Opus 4.1 |
86.3%
|
| 3 | gpt-5 |
83.33%
|
| 4 | Grok 4 |
83.3%
|
| 5 | Grok 4 Fast Reasoning |
81.33%
|
| 6 | Grok 3 |
78%
|
| 7 | o4 mini |
78%
|
| 8 | Claude Opus 4 |
77%
|
| 9 | o3 |
77%
|
| 10 | Claude Sonnet 3.7 |
74.6%
|
| 11 | Claude Sonnet 4 |
72.3%
|
| 12 | gpt-5-mini |
71.67%
|
| 13 | Kimi-K2-Thinking |
71.3%
|
| 14 | GPT-4.1 |
70.6%
|
| 15 | Gemini 2.5 Flash |
61%
|
| 16 | Nova Premier |
57%
|
| 17 | Gemini 2.5 Pro |
56.3%
|
| 18 | Nova Pro |
52.3%
|
| 19 | gpt-5-nano |
47%
|
| 20 | Llama 3.3 70B |
46.3%
|
| 21 | Llama 4 Maverick |
46.3%
|
| 22 | Llama 4 Scout |
44.3%
|
| 23 | o3-mini |
44.3%
|
| 24 | Nova Lite |
40%
|
| 25 | Mistral Large |
38.3%
|
| 26 | Codestral |
34%
|
| 27 | Nova Micro |
31%
|
| 28 | gpt-oss-120b |
30%
|
| 29 | Magistral Medium |
29.3%
|
| 30 | Command R+ |
25.7%
|
| 31 | Qwen 3 235B |
21.3%
|
| 32 | Llama 3.1 405B |
20%
|
| 33 | Command R |
15.3%
|
Data sample
The trace below shows a sample conversation from the benchmark:
Dataset
The evaluation dataset is publicly available on Hugging Face. It covers 6 distinct task types related to
small business insurance underwriting, with multi-turn conversational traces grounded in realistic
underwriter workflows.
Methodology
metric
Overall accuracy via LLM-as-a-Judge (GPT 4.1), comparing the agent’s final answer against a programmatically generated reference.
judge agreement
94.5% agreement on a balanced random sample of 200 conversations, validated against human annotations in Snorkel Evaluate.
scope
All traces are scored, including those where agents failed to produce a final answer due to recursion errors or premature termination.
failure rate
Agent failures observed at <1% for closed-source models and 10–30%+ for open-source models.
Behind the benchmark
We built the system in LangGraph with Model Context Protocol (MCP) and ReAct Agents. We engaged with our network of Chartered Property Casualty Underwriters (CPCUs) to create crucial components of the system, with a diverse sample dataset covering 6 distinct types of tasks, all related to applications for insurance by small businesses. Many of the tasks include subtasks involving more nuanced, complex underwriting logic. In each conversation, the underwriter has one of these specific tasks to solve. The tasks require an average of 3–7 steps of reasoning and tool use, with a total of 10–20 conversational turns.




