Archived

Finance Reasoning

A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions through tool-calling and planning.

Overview

This benchmark is an improvement over Snorkel Finance, which tested agents on tool-calling for financial queries but in which the queries required limited reasoning to answer the questions.

With the Financial Reasoning dataset, our aim was to create question-answer pairs that required models to reason in order to answer them correctly. An example query: "For AT&T, how significant are the company's postretirement benefit obligations in terms of interest burden, and what does this indicate about the company's long-term liability management in 2024?"

Leaderboard

Rank	Model	Score
1	Grok 4	53.1%
2	GPT-5.4	52%
3	Claude Sonnet 3.7	51.89%
4	gpt-5	51%
5	Claude Sonnet 4	49.37%
6	Claude Opus 4	48.1%
7	Gemini 3 Pro	46.84%
8	gpt-5-mini	46.8%
9	o4 mini	45.57%
10	Claude Opus 4.1	45.56%
11	GPT-4.1	44.3%
12	o3	43.04%
13	Grok 3	41.8%
14	Grok 4 Fast Reasoning	40.51%
15	NVIDIA Nemotron Super 49B v1.5	35.443%
16	Kimi-K2-Thinking	35%
17	Gemini 2.5 Pro	34.6%
18	Nova Premier	34.17%
19	Gemini 2.5 Flash	32%
20	gpt-oss-120b	31.6%
21	o3-mini	30.37%
22	gpt-5-nano	26.6%
23	Qwen 3 235B	17.7%
24	Magistral Medium	13.92%
25	Nova Pro	12.65%
26	Mistral Large	10.12%

Data sample

The current version of this benchmark contains 79 expert-co-created QA pairs, with a plan to release a larger version in the near future. You can find the dataset on HuggingFace. Here's an example interaction in this dataset.

User

For AT&T, How significant are the company's postretirement benefit obligations in terms of interest burden, and what does this indicate about the company's long-term liability management in 2024?

Agent

I'll help you analyze AT&T's post-retirement benefit obligations and their interest burden to understand the company's long-term liability management in 2024. Let me start by exploring the available data

TOOL CALL

get_descriptions({
"company_name": "at_t"
})

Tool :

["us_gaap_ScheduleOfComponentsOfIncomeTaxExpenseBenefitTableTextBlock", "us_gaap_ScheduleOfDebtTableTextBlock", '...']

Agent

Let me look for tables related to pension and postretirement benefits:

Tool :

...

Agent

Let me also check for plan obligations information:

TOOL CALL

get_table_info({
  "company_name": "at_t",
  "table_name":
  "_ScheduleOfPlanObligationsInExcessOfPlanAssetsTableTextBlock"
})

Tool :

{"description": "the_following_table_presents_the_change_in_the_projected_benefit_obligation_for_the_years_ended_december_31","column_names": ["benefit_obligation_component","pension_benefits_2024","pension_benefits_2023",...]}

Agent

This is exactly what I need! Let me query this table to get the interest cost on post-retirement benefits...

Tool :

[{"assumption_type":"weighted-average_discount_rate_for_determining_benefit_obligation_at_december_31","postretirement_benefits_2024":...}]

Agent

Based on my analysis of AT&T's post-retirement benefit obligations and their interest burden, I can provide you with a comprehensive assessment:

Analysis Summary:

AT&T's postretirement benefit obligations represent a relatively modest portion of the company's overall interest burden, indicating effective long-term liability management in 2024. The postretirement benefit interest cost of $310 million accounts for approximately 4.6% of AT&T's total interest expense of $6,759 million, while the total postretirement benefit obligation of $6,339 million represents only 4.7% of the company's total long-term debt of approximately $136 billion. The company has successfully reduced its postretirement benefit obligations from $6,693 million to $6,339 million year-over-year, demonstrating proactive liability management, while benefiting from higher discount rates (5.60% in 2024 vs 5.00% in 2023) that help reduce the present value of future obligations, ultimately positioning these liabilities as well-controlled and not a significant financial burden relative to the company's overall debt structure.

12 tool calls • Scored: correct & complete

Dataset

The evaluation dataset is publicly available on Hugging Face. It includes question-answer pairs co-created with Snorkel's financial expert network, covering multi-step reasoning tasks grounded in 10-K filings.

Dataset

Methodology

Evaluator

Claude Sonnet 3.7, with access to the ground truth answer, scoring for final answer correctness and completeness.

Timeout

15 minutes per trace, 100 turns maximum.

Integration

LangChain's integration for all reported models in the simulation engine.

Scoring Note

Accuracy reported across all traces, including those in which agents failed to provide a final task solution due to recursion/API errors in LangGraph, or errant behavior that forced the conversation to conclude prematurely.

Behind the benchmark

As with Snorkel Finance, we aimed to create a realistic environment in which a financial analyst agent can find answers to high-level questions based on information in 10-K filings. To do this, we converted information from tables in 10-K documents into a relational database. Agents must reason about what information is required, use database tools to look up the correct tables, make accurate SQL calls often in succession, and combine answers to produce a final response.

Question-answer pairs have been carefully co-created with Snorkel's Expert Data-as-a-Service network of financial experts, to ensure they are high quality, representative of real-world financial analyst questions, accurate, and require sufficient reasoning. This is a challenging task, requiring an average of 12 steps of reasoning and tool use.