Finance Reasoning
Overview
This benchmark is an improvement over Snorkel Finance, which tested agents on tool-calling for financial queries, but in which the queries required limited reasoning to answer the questions.
With the Financial Reasoning dataset, our aim was to create question-answer pairs that required models to reason in order to answer them correctly. An example of a financial query is "For AT&T, How significant are the company's postretirement benefit obligations in terms of interest burden, and what does this indicate about the company's long-term liability management in 2024?".
As with Snorkel Finance, we aimed to create a realistic environment in which a financial analyst agent can find answers to high level questions that are based on information in 10-K filings reports. To do this, we converted the information from the tables in 10-K documents into a relational database. Agents, therefore, need to reason about what information is required to answer the question, use the database tools to look up the correct tables, and make accurate SQL calls, often in succession, and combine answers to produce a final response.
Our question-answer pairs have been carefully co-created with Snorkel's Expert Data-as-a-Service network of financial experts, to ensure they are high quality. QA pairs are representative of real world financial analyst questions, accurate, and require sufficient reasoning to answer. This is a challenging task, requiring an average of 12 steps of reasoning and tool use.
Model Comparison
Data Sample
The current version of this benchmark contains 79 expert-co-created QA pairs, with a plan to release a larger version in the near future. You can find the dataset on HuggingFace. Here's an example interaction in this dataset.

Evaluation Methodology
We evaluate the agentic traces with Claude Sonnet 3.7, which has access to the ground truth answer, for final answer correctness and completeness. We use Langchain's integration for all reported models in the simulation engine. We use a timeout of 15 minutes per trace, and 100 turns, which we have found to work well with most models.
NOTE: We report accuracy on all traces, including those in which AI agents failed to provide a final task solution, either due to recursion/API errors in langgraph or errant behavior from the AI agent that forced the conversation to conclude prematurely.