Back to benchmarks
Released April 04, 2026
Archived

Finance Reasoning

A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions through tool-calling and planning.
Overview

This benchmark is an improvement over Snorkel Finance, which tested agents on tool-calling for financial queries but in which the queries required limited reasoning to answer the questions.

With the Financial Reasoning dataset, our aim was to create question-answer pairs that required models to reason in order to answer them correctly. An example query: "For AT&T, how significant are the company's postretirement benefit obligations in terms of interest burden, and what does this indicate about the company's long-term liability management in 2024?"

Leaderboard

Rank Model Score
1 Grok 4
53.1%
2 GPT-5.4
52%
3 Claude Sonnet 3.7
51.89%
4 gpt-5
51%
5 Claude Sonnet 4
49.37%
6 Claude Opus 4
48.1%
7 Gemini 3 Pro
46.84%
8 gpt-5-mini
46.8%
9 o4 mini
45.57%
10 Claude Opus 4.1
45.56%
11 GPT-4.1
44.3%
12 o3
43.04%
13 Grok 3
41.8%
14 Grok 4 Fast Reasoning
40.51%
15 NVIDIA Nemotron Super 49B v1.5
35.443%
16 Kimi-K2-Thinking
35%
17 Gemini 2.5 Pro
34.6%
18 Nova Premier
34.17%
19 Gemini 2.5 Flash
32%
20 gpt-oss-120b
31.6%
21 o3-mini
30.37%
22 gpt-5-nano
26.6%
23 Qwen 3 235B
17.7%
24 Magistral Medium
13.92%
25 Nova Pro
12.65%
26 Mistral Large
10.12%

Data sample

The current version of this benchmark contains 79 expert-co-created QA pairs, with a plan to release a larger version in the near future. You can find the dataset on HuggingFace. Here's an example interaction in this dataset.
User
For AT&T, How significant are the company's postretirement benefit obligations in terms of interest burden, and what does this indicate about the company's long-term liability management in 2024?
Agent
I'll help you analyze AT&T's post-retirement benefit obligations and their interest burden to understand the company's long-term liability management in 2024. Let me start by exploring the available data
TOOL CALL
get_descriptions({
  "company_name": "at_t"
})
Tool :
["us_gaap_ScheduleOfComponentsOfIncomeTaxExpenseBenefitTableTextBlock", "us_gaap_ScheduleOfDebtTableTextBlock", '...']
Agent
Let me look for tables related to pension and postretirement benefits:
Tool :
...
Agent
Let me also check for plan obligations information:
TOOL CALL

get_table_info({
  "company_name": "at_t",
  "table_name":
  "_ScheduleOfPlanObligationsInExcessOfPlanAssetsTableTextBlock"
})

Tool :
{"description": "the_following_table_presents_the_change_in_the_projected_benefit_obligation_for_the_years_ended_december_31","column_names": ["benefit_obligation_component","pension_benefits_2024","pension_benefits_2023",...]}
Agent
This is exactly what I need! Let me query this table to get the interest cost on post-retirement benefits...
Tool :
[{"assumption_type":"weighted-average_discount_rate_for_determining_benefit_obligation_at_december_31","postretirement_benefits_2024":...}]
Agent

Based on my analysis of AT&T's post-retirement benefit obligations and their interest burden, I can provide you with a comprehensive assessment:

Analysis Summary:

AT&T's postretirement benefit obligations represent a relatively modest portion of the company's overall interest burden, indicating effective long-term liability management in 2024. The postretirement benefit interest cost of $310 million accounts for approximately 4.6% of AT&T's total interest expense of $6,759 million, while the total postretirement benefit obligation of $6,339 million represents only 4.7% of the company's total long-term debt of approximately $136 billion. The company has successfully reduced its postretirement benefit obligations from $6,693 million to $6,339 million year-over-year, demonstrating proactive liability management, while benefiting from higher discount rates (5.60% in 2024 vs 5.00% in 2023) that help reduce the present value of future obligations, ultimately positioning these liabilities as well-controlled and not a significant financial burden relative to the company's overall debt structure.

12 tool calls • Scored: correct & complete

Dataset

The evaluation dataset is publicly available on Hugging Face. It includes question-answer pairs co-created with Snorkel's financial expert network, covering multi-step reasoning tasks grounded in 10-K filings.

Methodology

Evaluator
Claude Sonnet 3.7, with access to the ground truth answer, scoring for final answer correctness and completeness.
Timeout
15 minutes per trace, 100 turns maximum.
Integration
LangChain's integration for all reported models in the simulation engine.
Scoring Note
Accuracy reported across all traces, including those in which agents failed to provide a final task solution due to recursion/API errors in LangGraph, or errant behavior that forced the conversation to conclude prematurely.

Behind the benchmark

As with Snorkel Finance, we aimed to create a realistic environment in which a financial analyst agent can find answers to high-level questions based on information in 10-K filings. To do this, we converted information from tables in 10-K documents into a relational database. Agents must reason about what information is required, use database tools to look up the correct tables, make accurate SQL calls often in succession, and combine answers to produce a final response.

Question-answer pairs have been carefully co-created with Snorkel's Expert Data-as-a-Service network of financial experts, to ensure they are high quality, representative of real-world financial analyst questions, accurate, and require sufficient reasoning. This is a challenging task, requiring an average of 12 steps of reasoning and tool use.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.