The Snorkel AI Blog

Introducing the Snorkel Agentic Coding Benchmark

Today, we’re sharing details about the Snorkel Agentic Coding benchmark—a comprehensive evaluation suite designed to test whether agents can handle the full complexity of software engineering work.

Kobie Crawford

January 9, 2026

2026: The year of environments

We just returned from NeurIPS 2025, and we’re still processing everything we saw. The energy around data-centric AI has never been stronger—and we couldn’t be more grateful to the research community for pushing these ideas forward.

Snorkel Team

December 10, 2025

Part V: Future direction and emerging trends

Explores how rubrics support agentic, multi-turn, tool-using, multimodal, and code-generating AI systems, and how they evolve with AI feedback and ensemble evaluation.

Justin Bauer

December 5, 2025

The self-critique paradox: Why AI verification fails where it’s needed most

TL;DR: We stress-tested the “generate → criticize → improve” loop on 50 visual reasoning tasks. The results were counterintuitive: self-critique acts as a corrosive agent on high-performance tasks, turning 98% accuracy into 57%. Yet, for tasks where models fail completely, it works like magic. This difficulty-dependent behavior poses a critical, hidden risk for RLFT pipelines. The promise vs. the reality…

Armin Parchami

November 26, 2025

A chat with the Terminal-Bench team

Snorkel Chief Scientist Fred Sala and Kobie Crawford chat with the Terminal-Bench team to unpack the design behind Terminal-Bench 2.0 and the new Harbor framework.

Kobie Crawford, Fred Sala

November 19, 2025

Intelligence per watt: A new metric for AI’s future

Snorkel AI contributes specialized datasets to Hazy Research’s “Intelligence-per-Watt” study, advancing how efficiently AI turns energy into intelligence.

Kobie Crawford

November 12, 2025

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Terminal-Bench 2.0 launches today, marking a major leap in AI agent evaluation. Snorkel AI contributed key research and task design to this release.

Kobie Crawford

November 7, 2025

Snorkeling in RL environments

We unpack what makes a high-quality RL environment for LLMs and show how we build realistic, enterprise-grade environments at Snorkel AI.

Armin Parchami

November 4, 2025

Introducing SnorkelSpatial

A procedurally generated and programmatically verified benchmark for evaluating spatial reasoning capabilities in LLMs Large language models (LLMs) are showing remarkable results on solving complex reasoning problems across domains—from mathematical proofs and logical puzzles to graduate-level science and engineering questions. On the other hand, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday tasks. We…

Harit Vishwakarma

October 24, 2025

Scaling trust: rubrics in Snorkel’s quality process

Snorkel’s “Trusted Scale” philosophy Welcome to Part 4 of Snorkel AI’s rubric series. In previous posts, we explored how rubrics enable structured evaluation (Part 1), the spectrum of rubric types and use cases (Part 2), and the science behind designing and validating them (Part 3). In this latest installment, we pull back the curtain on how Snorkel puts these principles…

Derek Pham

October 16, 2025

Evaluating multi-agent systems in enterprise tool use

In recent months, there has been increasing interest in the area of multi-agent systems and how they can be used to solve more complex tasks than a single agent could accomplish on its own. The topic is particularly interesting and raises several questions and ideas to consider: Anthropic’s blog post about how they architected a multi-agent deep research system is…

Bhavishya Pohani

October 9, 2025

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing…

Kobie Crawford, Jeong Shin, Tom Walshe

September 30, 2025

Parsing isn’t neutral: why evaluation choices matter

Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself. Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy. In…

Justin Bauer

September 26, 2025

The science of rubric design

Part 3 of our rubric series explains the science of rubric design. We show why rubrics should be treated like models—structured, measured, and iterated—to maximize objective alignment and inter-rater agreement. Learn how to choose hierarchy and scale points, track agreement (IAA) and LLMAJ alignment, and refine with domain experts, with examples like PaperBench and HealthBench.

Charles Dickens, Chris Glaze

September 11, 2025

The right tool for the job: An A-Z of rubrics

Rubrics turn fuzzy “good vs. bad” into measurable criteria for GenAI. In Part 2, we map what to measure (granularity and dataset-level vs instance-specific), where to measure (process vs outcome), and how to measure (humans, LLM-as-judge, code, reward models)—with examples like HHH, FLASK, HealthBench, and PaperBench.

Tom Walshe, Armin Parchami

September 2, 2025

Data quality and rubrics: how to build trust in your models

Rubrics aren’t just for evaluation—they’re a blueprint for better data annotation. In this post, we explore how structured rubrics enable scalable, high-quality labeling and evaluation of GenAI systems. Learn how Snorkel and leading labs use rubrics to align human and automated judgment and accelerate trusted AI development.

Armin Parchami

July 29, 2025

Building the benchmark: inside our agentic insurance underwriting dataset

In this post, we unpack how Snorkel built a realistic benchmark dataset to evaluate AI agents in commercial insurance underwriting. From expert-driven data design to multi-tool reasoning tasks, see how our approach surfaces actionable failure modes that generic benchmarks miss—revealing what it really takes to deploy AI in enterprise workflows.

Chris Glaze, Fred Sala

July 10, 2025

Evaluating AI agents for insurance underwriting

In this post, we will show you a specialized benchmark dataset we developed with our expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark uncovers several model-specific and actionable error modes, including basic tool use errors and a surprising number of insidious hallucinations from one provider. This is part of an ongoing series of benchmarks we are releasing across verticals…

Annotation, LLMs

Chris Glaze

June 26, 2025

LLM observability: key practices, tools, and challenges

LLM observability is crucial for monitoring, debugging, and improving large language models. Learn key practices, tools, and strategies of LLM observability.

Snorkel Team

June 23, 2025

Anthropic Claude + AWS: revolutionizing pharma data analytics with Snorkel AI

Explore how Anthropic Claude + AWS help pharmaceutical companies leverage AI for enhanced data insights and revenue growth.

Healthcare, LLMs, Partners

Shan Kandaswamy (AWS), Matt Casey

June 4, 2025

Data-centric development of an enterprise AI agent with Snorkel

See how we can use these two new products—Snorkel Evaluate and Expert Data-as-a-Service–to evaluate and develop a specialized agentic AI system for an enterprise use case

Alex Ratner

May 29, 2025

Building the data development platform for specialized AI

Announcing two new products on our AI Data Development Platform that together create a complete solution for enterprises to specialize AI systems with expert data at scale.

Alex Ratner

May 29, 2025

LLM-as-a-judge for enterprises: evaluate model alignment at scale

Discover how enterprises can leverage LLM-as-Judge systems to evaluate generative AI outputs at scale, improve model alignment, reduce costs, and tackle challenges like bias and interpretability.

Annotation, Evaluation, LLMs

Matt Casey, Tom Walshe

March 26, 2025

Why GenAI evaluation requires SME-in-the-loop for validation and trust

It’s critical enterprises can trust and rely on GenAI evaluation results, and for that, SME-in-the-loop workflows are needed. In my first blog post on enterprise GenAI evaluation, I discussed the importance of specialized evaluators as a scalable proxy for SMEs. It simply isn’t practical to task SMEs with performing manual evaluations – it can take weeks if not longer, unnecessarily…

Evaluation, GenAI

Shane Johnson

March 20, 2025

Research spotlight: is long chain-of-thought structure all that matters when it comes to LLM reasoning distillation?

We’re taking a look at the research paper, LLMs can easily learn to reason from demonstration (Li et al., 2025), in this week’s community research spotlight. It focuses on how the structure of reasoning traces impacts distillation from models such as DeepSeek R1. What’s the big idea regarding LLM reasoning distillation? The reasoning capabilities of powerful models such as DeepSeek…

benchmarks, distillation, Fine-Tuning, GenAI, llm

Shane Johnson

March 19, 2025

Why enterprise GenAI evaluation requires fine-grained metrics to be insightful

GenAI needs fine-grained evaluation for AI teams to gain actionable insights.

Evaluation, GenAI

Shane Johnson

March 18, 2025

What is specialized GenAI evaluation, and why is it so critical to enterprise AI?

Specialized GenAI evaluation ensures AI assistants meet business requirements, SME expertise, and industry regulations—critical for production-ready AI.

Evaluation, GenAI

Shane Johnson

March 5, 2025

LLM alignment techniques: 4 post-training approaches

Ensure your LLMs align with your values and goals using LLM alignment techniques. Learn how to mitigate risks and optimize performance.

Alignment, LLMs

Tom Walshe

March 4, 2025

Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering?

Learn how ARR improves QA accuracy in LLMs through intent analysis, retrieval, and reasoning. Is intent the key to smarter AI? Explore ARR results!

aar, chain-of-thought

Shane Johnson

February 27, 2025

Why enterprises should embrace LLM distillation

Unlock possibilities for your enterprise with LLM distillation. Learn how distilled, task-specific models boost performance and shrink costs.

Data Development, Data Labeling, Foundation Models, LLMs

Shane Johnson

February 18, 2025

Introducing the Snorkel Agentic Coding Benchmark

Latest posts

Introducing the Snorkel Agentic Coding Benchmark

2026: The year of environments

Part V: Future direction and emerging trends

The self-critique paradox: Why AI verification fails where it’s needed most

A chat with the Terminal-Bench team

Intelligence per watt: A new metric for AI’s future

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Snorkeling in RL environments

Introducing SnorkelSpatial

Scaling trust: rubrics in Snorkel’s quality process

Evaluating multi-agent systems in enterprise tool use

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Parsing isn’t neutral: why evaluation choices matter

The science of rubric design

The right tool for the job: An A-Z of rubrics

Data quality and rubrics: how to build trust in your models

Building the benchmark: inside our agentic insurance underwriting dataset

Evaluating AI agents for insurance underwriting

LLM observability: key practices, tools, and challenges

Anthropic Claude + AWS: revolutionizing pharma data analytics with Snorkel AI

Data-centric development of an enterprise AI agent with Snorkel

Building the data development platform for specialized AI

LLM-as-a-judge for enterprises: evaluate model alignment at scale

Why GenAI evaluation requires SME-in-the-loop for validation and trust

Research spotlight: is long chain-of-thought structure all that matters when it comes to LLM reasoning distillation?

Why enterprise GenAI evaluation requires fine-grained metrics to be insightful

What is specialized GenAI evaluation, and why is it so critical to enterprise AI?

LLM alignment techniques: 4 post-training approaches

Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering?

Why enterprises should embrace LLM distillation

Join our newsletter for expert advice, the latest research, and exclusive events.