Chris Glaze

Blog

Continual learning and evaluating how AI agents learn across sequences of tasks

Most agent benchmarks evaluate each task as an independent episode. The agent receives a task, produces an answer, gets scored, and moves on. The next task starts as if the previous one never happened. That setup misses a core requirement for deployed agents. A coding agent, research assistant, data analyst, or workplace assistant should improve as it works across repeated…

Jun 29, 2026 •

Chris Glaze

Learn more about Continual learning and evaluating how AI agents learn across sequences of tasks

Blog

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

The Snorkel research team collaborated with the rLLM team at UC Berkeley on the Agentica project, using their open-source rLLM framework to fine-tune Qwen3-4B-Instruct-2507, delivering a model that beats Qwen3-235B-A22B on Snorkel AI’s expert-curated financial benchmarks – at 1/60th the size. A full breakdown of the results are published in the rLLM blog here. The key insight? Just focus on…

Feb 18, 2026 •

Chris Glaze

Learn more about How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Benchmarking Agents in Insurance Underwriting Environments

As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real-world enterprise challenges. UNDERWRITE introduces critical realism factors often absent in current benchmarks: proprietary business knowledge, noisy tool interfaces, and imperfect simulated users requiring careful information gathering. Evaluating 13 frontier models, we uncover significant gaps between research lab performance and enterprise readiness: the most accurate models are not...

Research Paper

Accepted to CAIS 2026

Benchmarking Agents in Insurance Underwriting Environments

As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real-world enterprise challenges. UNDERWRITE introduces critical realism factors…

Jan 31, 2026 •

Snorkel Team

Learn more about Benchmarking Agents in Insurance Underwriting Environments

Blog

The science of rubric design

Part 3 of our rubric series explains the science of rubric design. We show why rubrics should be treated like models—structured, measured, and iterated—to maximize objective alignment and inter-rater agreement. Learn how to choose hierarchy and scale points, track agreement (IAA) and LLMAJ alignment, and refine with domain experts, with examples like PaperBench and HealthBench.

Sep 11, 2025 •

Charles Dickens, Chris Glaze

Learn more about The science of rubric design

Blog

Building the benchmark: inside our agentic insurance underwriting dataset

In this post, we unpack how Snorkel built a realistic benchmark dataset to evaluate AI agents in commercial insurance underwriting. From expert-driven data design to multi-tool reasoning tasks, see how our approach surfaces actionable failure modes that generic benchmarks miss—revealing what it really takes to deploy AI in enterprise workflows.

Jul 10, 2025 •

Chris Glaze, Fred Sala

Learn more about Building the benchmark: inside our agentic insurance underwriting dataset

Blog

Evaluating AI agents for insurance underwriting

In this post, we will show you a specialized benchmark dataset we developed with our expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark uncovers several model-specific and actionable error modes, including basic tool use errors and a surprising number of insidious hallucinations from one provider. This is part of an ongoing series of benchmarks we are releasing across verticals…

Jun 26, 2025 •

Chris Glaze

Learn more about Evaluating AI agents for insurance underwriting

Blog

How does the Snorkel Flow label model work?

The Snorkel Flow label model plays an instrumental role in driving the enterprise value we create. Here’s a peek at how it works.

Jun 18, 2024 •

Chris Glaze

Learn more about How does the Snorkel Flow label model work?

Blog

Walking safely before building flying saucer seatbelts: introducing Enterprise Alignment

Snorkel takes a step on the path to enterprise superalignment with new data development workflows for enterprise alignment

May 20, 2024 •

Alex Ratner, Tom Walshe, Chris Glaze, Fred Sala, Paroma Varma, Hoang Tran

Learn more about Walking safely before building flying saucer seatbelts: introducing Enterprise Alignment

Blog

Building better enterprise AI: incorporating expert feedback in system development

Enterprises that aim to build valuable GenAI applications must view them from a systems-level. LLMs are just one part of an ecosystem.

Jan 30, 2024 •

Chris Glaze

Learn more about Building better enterprise AI: incorporating expert feedback in system development

Chris Glaze

The latest from Chris

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?