Archived

SnorkelFinance

A benchmark of expert-verified financial QA created from financial reports for evaluating AI agents on tool-calling and reasoning capabilities.

Overview

Enterprises are increasingly integrating LLMs in their own ecosystems, as agents, to solve real-world business tasks. While the field moves rapidly into LLM agent development and adoption, evaluating AI agents remains a challenge in dynamic interactions involving complex tools and planning.

We present a new benchmark, SnorkelFinance, for evaluating the performance of AI agents as a financial analyst. LLMs act as ReAct agents in our financial-agent simulation engine (built in Langgraph with Model Context Protocol), tasked with answering financial questions created from 10-K filings. Without access to the document, Agents must plan and use provided tools, including SQL calls, code executors, and others, to answer the financial questions accurately.

Leaderboard

Rank	Model	Score
1	gpt-5	81%
2	o3	81%
3	Gemini 3 Pro	80.34%
4	Claude Opus 4.1	80.3%
5	gpt-5-mini	79.3%
6	Claude Opus 4	78.3%
7	Claude Sonnet 3.7	77.9%
8	Claude Sonnet 4	76.6%
9	o4 mini	76.6%
10	Grok 4	74.04%
11	Grok 4 Fast Reasoning	73.45%
12	Kimi-K2-Thinking	71.7%
13	gpt-oss-120b	66.6%
14	Grok 3	65.86%
15	o3-mini	63.79%
16	GPT-4.1	62.7%
17	Nova Premier	62.06%
18	Gemini 2.5 Pro	60.6%
19	Gemini 2.5 Flash	53.1%
20	Qwen 3 235B	51.37%
21	gpt-5-nano	50%
22	NVIDIA Nemotron Super 49B v1.5	44%
23	Nova Pro	40.34%
24	Codestral	27.6%
25	Nova Lite	16.89%
26	Magistral Medium	16.2%
27	Nova Micro	14.48%
28	Mistral Large	13.4%

Sample task

The latest version contains 290 high-quality QA pairs, sampled over 5 industry verticals. Here is a sample of an agentic trace on a financial query.

User

For company Snorkel AI, what is the ratio of Finite-Lived Intangible Assets Net to Total Intangible Assets Net Excluding Goodwill for 2023?

Agent

To find the ratio of Finite-Lived Intangible Assets Net to Total Intangible Assets Net Excluding Goodwill for 2023, we first need to identify the table that contains this information. Let's start by calling get_descriptions to list the tables for Snorkel AI

TOOL CALL

get_table_names({
"company_name": "Snorkel AI"
})

Tool :

['us_gaap_ScheduleOfShareBasedPaymentAwardStockOptionsValuationAssumptionsTableTextBlock', 'cb_ScheduleVITableTextBlock', 'us_gaap_ScheduleOfIntangibleAssetsAndGoodwillTableTextBlock', '...']

Agent

We can see that us_gaap_ScheduleOfIntangible... is the table that likely contains information on Intangible Assets and Goodwill. Let's get the schema for this table to confirm and find the relevant information.

TOOL CALL

{
"function_call": {
"name": "get_table_schema",
"args": {
"company_name": "Snorkel AI",
"table_name": "us_gaap_ScheduleOfIntangible..."
}
}
}

Tool :

{"description": "in_millions_of_u.s._dollars_as_of_december_31", "column_names": ["category","2024","2023"],...}

Tool :

...

Agent

The Total Intangible Assets Net Excluding Goodwill for 2023 was $6,775 million. The Finite-Lived Intangible Assets Net for 2023 was $3,267 million. The ratio is [ 3,267 / 6,775 ]. Let's calculate this.

TOOL CALL

{
"function_call": {
"name": "calculator",
"args": {
"expression": "3267/6775"
}
}
}

Tool :

0.4822140221402214

Agent

FINAL ANSWER:

The ratio of Finite-Lived Intangible Assets Net to Total Intangible Assets Net Excluding Goodwill for 2023, when rounded to two decimal places, is approximately 0.48.

Methodology

JUDGE

Claude Sonnet 3.7 evaluates full execution traces, scoring both final outputs and intermediate reasoning steps.

DIMENSIONS

SQL query correctness, financial calculation accuracy, tool selection appropriateness, and overall task completion.

VALIDATION

Automated scoring calibrated against expert human annotations on a representative sample, achieving high inter-annotator agreement.

TASK COMPLEXITY

Ranges from basic financial data retrieval to multi-document analysis requiring multi-step reasoning chains.

Behind the benchmark

Our QA dataset is carefully verified by Snorkel’s network of financial experts, for realism and accuracy of the task data, on a 5-point scale, and curated for high realism and accuracy.

While closed-source models have a high, but similar performance, on standard STEM benchmarks, results on SnorkelFinance show significant differences in agent performance across models, highlighting that more work is needed to make agents reliable in complex domains and tasks.

SnorkelFinance is the first benchmark, to our knowledge, to measure performance of commonly used Enterprise models, on a Financial Agentic task. We observe several error modes: agents struggling to make complex SQL calls, hallucinations of tool arguments, and failing to correct course when tool executions fail.

Get notified when we launch a new benchmark

Share this benchmark

SnorkelFinance

Leaderboard

Sample task

Methodology

Behind the benchmark

Get notified when we launch a new benchmark

More benchmarks

Frontier-Bench

Senior SWE-bench

OSWorld 2.0

Agents’ Last Exam

Agentic Coding

SlopCode Bench

Continual Learning Bench

Terminal-Bench 2.1

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?