resources

Resource library

Explore our complete library of resources including blogs, benchmarks, research papers, and more.

Image for Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark
Blog

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Announcing a $3M commitment to launch Open Benchmarks Grants
September 30, 2025
Image for Closing the Evaluation Gap in Agentic AI
Blog

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants

February 11, 2026
Image for Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory
Blog

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Announcing a $3M commitment to launch Open Benchmarks Grants
March 31, 2026
Image for Building FinQA: An Open RL Environment for Financial Reasoning Agents
Blog

Building FinQA: An Open RL Environment for Financial Reasoning Agents

Announcing a $3M commitment to launch Open Benchmarks Grants
March 30, 2026
Image for The science of rubric design
Blog

The science of rubric design

Announcing a $3M commitment to launch Open Benchmarks Grants
September 11, 2025
of
Type: All Types
Sort: Newest
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
Blog
NEW
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines

In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on…

Jun 05, 2026
Learn more about The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Blog
NEW
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control….

Jun 04, 2026
Learn more about Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Benchtalks #2: The future of coding benchmarks
Blog
NEW
Benchtalks #2: The future of coding benchmarks

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel…

Jun 03, 2026
Learn more about Benchtalks #2: The future of coding benchmarks
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys--including at major U.S. law firms--with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality...
Research Paper
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys–including at…

May 26, 2026
Learn more about JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman
Blog
Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman

Christopher Sniffen recently sat down with Rezaur Rahman — CIO / CISO / CAIO at the Advisory Council on Historic Preservation — for a conversation on what it actually takes to build frontier AI for federal infrastructure. They get into the limits of frontier models on geospatial reasoning, mechanistic interpretability for applied AI, the trick that makes vision models useful…

May 14, 2026
Learn more about Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman
Code World Models and AutoHarness for LLM Agents
Blog
Code World Models and AutoHarness for LLM Agents

At our latest Snorkel AI Reading Group, Carter Wendelken of Google DeepMind walked us through two related papers he presented at ICLR: Code World Models for General Game Playing and AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness. Both ask the same question from opposite ends: when you want an LLM to act reliably in a complex, possibly…

May 14, 2026
Learn more about Code World Models and AutoHarness for LLM Agents
Why coding agents need better data, evals, and environments
Blog
Why coding agents need better data, evals, and environments

Coding agents have moved from tab-complete to teammate. They autonomously inspect repositories, edit files, run commands, diagnose failures, and work through multi-step engineering tasks. That creates a harder reliability problem. A model that only suggests code is easy for a human to evaluate. A coding agent refactoring your repository and testing its own changes is much harder to supervise –…

May 11, 2026
Learn more about Why coding agents need better data, evals, and environments
Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development
Blog
Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development

At our latest Snorkel AI Reading Group, Mayee Chen (Stanford, Hazy Research) stopped by our San Francisco office to walk us through Olmix: A Framework for Data Mixing Throughout LM Development — work she contributed to during her internship at Ai2 on OLMo 3. Olmix tackles one of the messiest, least-documented levers in LLM pre-training: how to set the ratios…

May 01, 2026
Learn more about Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning,...
Research Paper
Accepted to MLSys 2026
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where…

Learn more about Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
1 2 63 64
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.