Back to benchmarks
Released April 04, 2026
Archived

SnorkelFinance

A benchmark of expert-verified financial QA created from financial reports for evaluating AI agents on tool-calling and reasoning capabilities.
Overview

Enterprises are increasingly integrating LLMs in their own ecosystems, as agents, to solve real-world business tasks. While the field moves rapidly into LLM agent development and adoption, evaluating AI agents remains a challenge in dynamic interactions involving complex tools and planning.

We present a new benchmark, SnorkelFinance, for evaluating the performance of AI agents as a financial analyst. LLMs act as ReAct agents in our financial-agent simulation engine (built in Langgraph with Model Context Protocol), tasked with answering financial questions created from 10-K filings. Without access to the document, Agents must plan and use provided tools, including SQL calls, code executors, and others, to answer the financial questions accurately.

Leaderboard

Rank Model Score
1 gpt-5
81%
2 o3
81%
3 Gemini 3 Pro
80.34%
4 Claude Opus 4.1
80.3%
5 gpt-5-mini
79.3%
6 Claude Opus 4
78.3%
7 Claude Sonnet 3.7
77.9%
8 Claude Sonnet 4
76.6%
9 o4 mini
76.6%
10 Grok 4
74.04%
11 Grok 4 Fast Reasoning
73.45%
12 Kimi-K2-Thinking
71.7%
13 gpt-oss-120b
66.6%
14 Grok 3
65.86%
15 o3-mini
63.79%
16 GPT-4.1
62.7%
17 Nova Premier
62.06%
18 Gemini 2.5 Pro
60.6%
19 Gemini 2.5 Flash
53.1%
20 Qwen 3 235B
51.37%
21 gpt-5-nano
50%
22 NVIDIA Nemotron Super 49B v1.5
44%
23 Nova Pro
40.34%
24 Codestral
27.6%
25 Nova Lite
16.89%
26 Magistral Medium
16.2%
27 Nova Micro
14.48%
28 Mistral Large
13.4%

Sample task

The latest version contains 290 high-quality QA pairs, sampled over 5 industry verticals. Here is a sample of an agentic trace on a financial query.

User
For company Snorkel AI, what is the ratio of Finite-Lived Intangible Assets Net to Total Intangible Assets Net Excluding Goodwill for 2023?
Agent
To find the ratio of Finite-Lived Intangible Assets Net to Total Intangible Assets Net Excluding Goodwill for 2023, we first need to identify the table that contains this information. Let's start by calling get_descriptions to list the tables for Snorkel AI
TOOL CALL
get_table_names({
  "company_name": "Snorkel AI"
})
Tool :
['us_gaap_ScheduleOfShareBasedPaymentAwardStockOptionsValuationAssumptionsTableTextBlock', 'cb_ScheduleVITableTextBlock', 'us_gaap_ScheduleOfIntangibleAssetsAndGoodwillTableTextBlock', '...']
Agent
We can see that us_gaap_ScheduleOfIntangible... is the table that likely contains information on Intangible Assets and Goodwill. Let's get the schema for this table to confirm and find the relevant information.
TOOL CALL

get_table_schema({
  "company_name": "Snorkel AI",
  "table_name": "us_gaap_ScheduleOfIntangible..."
})

Tool :
{"description": "in_millions_of_u.s._dollars_as_of_december_31", "column_names": ["category","2024","2023"],...}
Tool :
...
Agent
The Total Intangible Assets Net Excluding Goodwill for 2023 was $6,775 million. The Finite-Lived Intangible Assets Net for 2023 was $3,267 million. The ratio is [ 3,267 / 6,775 ]. Let's calculate this.
TOOL CALL
calculator({
  "expression": "3267/6775"
})
Tool :
0.4822140221402214
Agent

FINAL ANSWER:

The ratio of Finite-Lived Intangible Assets Net to Total Intangible Assets Net Excluding Goodwill for 2023, when rounded to two decimal places, is approximately 0.48.

Methodology

JUDGE
Claude Sonnet 3.7 evaluates full execution traces, scoring both final outputs and intermediate reasoning steps.
DIMENSIONS
SQL query correctness, financial calculation accuracy, tool selection appropriateness, and overall task completion.
VALIDATION
Automated scoring calibrated against expert human annotations on a representative sample, achieving high inter-annotator agreement.
TASK COMPLEXITY

Ranges from basic financial data retrieval to multi-document analysis requiring multi-step reasoning chains.

Behind the benchmark

Our QA dataset is carefully verified by Snorkel’s network of financial experts, for realism and accuracy of the task data, on a 5-point scale, and curated for high realism and accuracy.

While closed-source models have a high, but similar performance, on standard STEM benchmarks, results on SnorkelFinance show significant differences in agent performance across models, highlighting that more work is needed to make agents reliable in complex domains and tasks.

SnorkelFinance is the first benchmark, to our knowledge, to measure performance of commonly used Enterprise models, on a Financial Agentic task. We observe several error modes: agents struggling to make complex SQL calls, hallucinations of tool arguments, and failing to correct course when tool executions fail.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.