Finance Reasoning

A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions, through tool-calling and planning.

Overview

This benchmark is an improvement over Snorkel Finance, which tested agents on tool-calling for financial queries, but in which the queries required limited reasoning to answer the questions.

With the Financial Reasoning dataset, our aim was to create question-answer pairs that required models to reason in order to answer them correctly. An example of a financial query is "For AT&T, How significant are the company's postretirement benefit obligations in terms of interest burden, and what does this indicate about the company's long-term liability management in 2024?".

As with Snorkel Finance, we aimed to create a realistic environment in which a financial analyst agent can find answers to high level questions that are based on information in 10-K filings reports. To do this, we converted the information from the tables in 10-K documents into a relational database. Agents, therefore, need to reason about what information is required to answer the question, use the database tools to look up the correct tables, and make accurate SQL calls, often in succession, and combine answers to produce a final response.

Our question-answer pairs have been carefully co-created with Snorkel's Expert Data-as-a-Service network of financial experts, to ensure they are high quality. QA pairs are representative of real world financial analyst questions, accurate, and require sufficient reasoning to answer. This is a challenging task, requiring an average of 12 steps of reasoning and tool use.

Data Sample

The current version of this benchmark contains 79 expert-co-created QA pairs, with a plan to release a larger version in the near future. You can find the dataset on HuggingFace. Here's an example interaction in this dataset.

Evaluation Methodology

We evaluate the agentic traces with Claude Sonnet 3.7, which has access to the ground truth answer, for final answer correctness and completeness. We use Langchain's integration for all reported models in the simulation engine. We use a timeout of 15 minutes per trace, and 100 turns, which we have found to work well with most models.

NOTE: We report accuracy on all traces, including those in which AI agents failed to provide a final task solution, either due to recursion/API errors in langgraph or errant behavior from the AI agent that forced the conversation to conclude prematurely.

Grok 4

53.1%

Claude Sonnet 3.7

51.89%

gpt-5

51%

Claude Sonnet 4

49.37%

Claude Opus 4

48.1%

Gemini 3 Pro

46.84%

gpt-5-mini

46.8%

o4 mini

45.57%

Claude Opus 4.1

45.56%

GPT-4.1

44.3%

43.04%

Grok 3

41.8%

Grok 4 Fast Reasoning

40.51%

NVIDIA Nemotron Super 49B v1.5

35.443%

Kimi-K2-Thinking

35%

Gemini 2.5 Pro

34.6%

Nova Premier

34.17%

Gemini 2.5 Flash

32%

gpt-oss-120b

31.6%

o3-mini

30.37%

gpt-5-nano

26.6%

Qwen 3 235B

17.7%

Magistral Medium

13.92%

Nova Pro

12.65%

Mistral Large

10.12%

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.

Talk to Snorkel