Image
author

Bhavishya Pohani

Applied Research Scientist (NLP)
,
Snorkel AI

Bhavishya Pohani is an Applied Research Scientist at Snorkel AI, focusing on large language model fine-tuning & agentic systems. Before Snorkel, he worked on building deep learning systems at Chubb Insurance.

The latest from Bhavishya

Building FinQA: An Open RL Environment for Financial Reasoning Agents
Blog
Building FinQA: An Open RL Environment for Financial Reasoning Agents

TL;DR: We built FinQA — a financial question-answering environment with 290 expert-curated questions across 22 public companies, now available on OpenEnv. Agents use MCP tools to discover schemas, write constrained SQL queries, and answer multi-step questions from real SEC 10-K filings. Most open-source models struggle with this kind of multi-step tool use, and even frontier closed-source models, while more accurate,…

Mar 30, 2026
Learn more about Building FinQA: An Open RL Environment for Financial Reasoning Agents
Benchmarking Agents in Insurance Underwriting Environments
As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real-world enterprise challenges. UNDERWRITE introduces critical realism factors often absent in current benchmarks: proprietary business knowledge, noisy tool interfaces, and imperfect simulated users requiring careful information gathering. Evaluating 13 frontier models, we uncover significant gaps between research lab performance and enterprise readiness: the most accurate models are not...
Research Paper
Accepted to CAIS 2026
Benchmarking Agents in Insurance Underwriting Environments

As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real-world enterprise challenges. UNDERWRITE introduces critical realism factors…

Jan 31, 2026
Snorkel Team
Learn more about Benchmarking Agents in Insurance Underwriting Environments
Evaluating multi-agent systems in enterprise tool use
Blog
Evaluating multi-agent systems in enterprise tool use

In recent months, there has been increasing interest in the area of multi-agent systems and how they can be used to solve more complex tasks than a single agent could accomplish on its own. The topic is particularly interesting and raises several questions and ideas to consider: Anthropic’s blog post about how they architected a multi-agent deep research system is…

Oct 09, 2025
Learn more about Evaluating multi-agent systems in enterprise tool use

For models that need to be right. Not just good enough.