SnorkelFinance

A benchmark of expert-verified financial QA created from financial reports for evaluating AI agents on tool-calling and reasoning capabilities.

Overview

Enterprises are increasingly integrating LLMs in their own ecosystems, as Agents, to solve real-world business tasks. While the field moves rapidly into LLM Agent development and adoption, evaluating AI agents remains a challenge in dynamic interactions involving complex tools and planning.

We present a new benchmark, SnorkelFinance, for evaluating the performance of AI Agents as a Financial Analyst. LLMs act as ReAct Agents in our Financial-Agent simulation engine (built in Langgraph with Model Context Protocol), tasked with answering financial questions created from 10-K filings documents. Without access to the document, Agents must plan and use provided tools — including SQL calls, code executors, and others, to answer the financial questions accurately.

Our QA dataset is carefully verified by Snorkel's network of financial experts, for realism and accuracy of the task data, on a 5-point scale, and curated for high realism and accuracy.

While closed-source models have a high, but similar performance, on standard STEM benchmarks, results on SnorkelFinance show significant differences in agent performance across models—highlighting that more work is needed to make agents reliable in complex domains and tasks.

SnorkelFinance is the first benchmark, to our knowledge, to measure performance of commonly used Enterprise models, on a Financial Agentic task. We observe several error modes - agents struggling to make complex SQL calls, hallucinations of tool arguments, and failing to correct course when tool executions fail.

Model Comparison

Loading...

Data Sample

The latest version contains 290 high-quality QA pairs, sampled over 5 industry verticals. Here is a sample of an agentic trace on a financial query.

Graph reasoning example

Evaluation Methodology

We evaluate performance using Claude Sonnet 3.7 as our LLM-based evaluation system to assess execution traces generated by AI agents on the SnorkelFinance benchmark. The evaluation process examines both the final outputs and the intermediate reasoning steps, focusing on tool usage accuracy, logical consistency, and task completion.

The evaluation framework assesses agents across multiple dimensions: correctness of SQL queries, accuracy of financial calculations, appropriate tool selection, and overall task success. Each trace is scored based on whether the agent successfully navigates the multi-step reasoning process and arrives at accurate financial conclusions.

To ensure evaluation reliability, we validated our automated scoring system against expert human annotations on a representative sample, achieving high inter-annotator agreement. The benchmark includes various task complexities, from basic financial data retrieval to complex multi-document analysis requiring sophisticated reasoning chains.

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.