SnorkelFinance
Enterprises are increasingly integrating LLMs in their own ecosystems, as agents, to solve real-world business tasks. While the field moves rapidly into LLM agent development and adoption, evaluating AI agents remains a challenge in dynamic interactions involving complex tools and planning.
We present a new benchmark, SnorkelFinance, for evaluating the performance of AI agents as a financial analyst. LLMs act as ReAct agents in our financial-agent simulation engine (built in Langgraph with Model Context Protocol), tasked with answering financial questions created from 10-K filings. Without access to the document, Agents must plan and use provided tools, including SQL calls, code executors, and others, to answer the financial questions accurately.
Leaderboard
| Rank | Model | Score |
|---|---|---|
| 1 | gpt-5 |
81%
|
| 2 | o3 |
81%
|
| 3 | Gemini 3 Pro |
80.34%
|
| 4 | Claude Opus 4.1 |
80.3%
|
| 5 | gpt-5-mini |
79.3%
|
| 6 | Claude Opus 4 |
78.3%
|
| 7 | Claude Sonnet 3.7 |
77.9%
|
| 8 | Claude Sonnet 4 |
76.6%
|
| 9 | o4 mini |
76.6%
|
| 10 | Grok 4 |
74.04%
|
| 11 | Grok 4 Fast Reasoning |
73.45%
|
| 12 | Kimi-K2-Thinking |
71.7%
|
| 13 | gpt-oss-120b |
66.6%
|
| 14 | Grok 3 |
65.86%
|
| 15 | o3-mini |
63.79%
|
| 16 | GPT-4.1 |
62.7%
|
| 17 | Nova Premier |
62.06%
|
| 18 | Gemini 2.5 Pro |
60.6%
|
| 19 | Gemini 2.5 Flash |
53.1%
|
| 20 | Qwen 3 235B |
51.37%
|
| 21 | gpt-5-nano |
50%
|
| 22 | NVIDIA Nemotron Super 49B v1.5 |
44%
|
| 23 | Nova Pro |
40.34%
|
| 24 | Codestral |
27.6%
|
| 25 | Nova Lite |
16.89%
|
| 26 | Magistral Medium |
16.2%
|
| 27 | Nova Micro |
14.48%
|
| 28 | Mistral Large |
13.4%
|
Sample task
The latest version contains 290 high-quality QA pairs, sampled over 5 industry verticals. Here is a sample of an agentic trace on a financial query.
Methodology
Ranges from basic financial data retrieval to multi-document analysis requiring multi-step reasoning chains.
Behind the benchmark
Our QA dataset is carefully verified by Snorkel’s network of financial experts, for realism and accuracy of the task data, on a 5-point scale, and curated for high realism and accuracy.
While closed-source models have a high, but similar performance, on standard STEM benchmarks, results on SnorkelFinance show significant differences in agent performance across models, highlighting that more work is needed to make agents reliable in complex domains and tasks.
SnorkelFinance is the first benchmark, to our knowledge, to measure performance of commonly used Enterprise models, on a Financial Agentic task. We observe several error modes: agents struggling to make complex SQL calls, hallucinations of tool arguments, and failing to correct course when tool executions fail.

