RESOURCES

Blog

Ideas, updates, and practical guidance from the Snorkel team.

Image for Closing the Evaluation Gap in Agentic AI

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants

February 11, 2026
All articles
Sort: Newest
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
NEW
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control….

Jun 04, 2026
Learn more about Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Benchtalks #2: The future of coding benchmarks
NEW
Benchtalks #2: The future of coding benchmarks

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel…

Jun 03, 2026
Learn more about Benchtalks #2: The future of coding benchmarks
Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman
Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman

Christopher Sniffen recently sat down with Rezaur Rahman — CIO / CISO / CAIO at the Advisory Council on Historic Preservation — for a conversation on what it actually takes to build frontier AI for federal infrastructure. They get into the limits of frontier models on geospatial reasoning, mechanistic interpretability for applied AI, the trick that makes vision models useful…

May 14, 2026
Learn more about Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman
Code World Models and AutoHarness for LLM Agents
Code World Models and AutoHarness for LLM Agents

At our latest Snorkel AI Reading Group, Carter Wendelken of Google DeepMind walked us through two related papers he presented at ICLR: Code World Models for General Game Playing and AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness. Both ask the same question from opposite ends: when you want an LLM to act reliably in a complex, possibly…

May 14, 2026
Learn more about Code World Models and AutoHarness for LLM Agents
Why coding agents need better data, evals, and environments
Why coding agents need better data, evals, and environments

Coding agents have moved from tab-complete to teammate. They autonomously inspect repositories, edit files, run commands, diagnose failures, and work through multi-step engineering tasks. That creates a harder reliability problem. A model that only suggests code is easy for a human to evaluate. A coding agent refactoring your repository and testing its own changes is much harder to supervise –…

May 11, 2026
Learn more about Why coding agents need better data, evals, and environments
Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development
Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development

At our latest Snorkel AI Reading Group, Mayee Chen (Stanford, Hazy Research) stopped by our San Francisco office to walk us through Olmix: A Framework for Data Mixing Throughout LM Development — work she contributed to during her internship at Ai2 on OLMo 3. Olmix tackles one of the messiest, least-documented levers in LLM pre-training: how to set the ratios…

May 01, 2026
Learn more about Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development
Benchmarks should shape the frontier, not just measure it
Benchmarks should shape the frontier, not just measure it

Since launching the Open Benchmarks Grants, we’ve received more than 100 applications from academic groups and industry labs spanning a wide range of domains and capabilities. As the best benchmarks drive how the field allocates research effort, the bar for benchmarks has risen as well. Here, we share what’s now table stakes for useful benchmarks, and what separates the ones…

Apr 07, 2026
Learn more about Benchmarks should shape the frontier, not just measure it
Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory
Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

To kick off our inaugural Benchtalks, a series dedicated to the researchers building these measurement toolkits, Snorkel AI co-founder Vincent Sunn Chen sat down with Alex Shaw, Founding MTS at Laude Institute and co-creator of Terminal-Bench and Harbor. Highlights More on Terminal-Bench: See the leaderboard and the catalog of tasks at tbench.ai. Explore Harbor: Learn how to scale your agent…

Mar 31, 2026
Learn more about Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory
Building FinQA: An Open RL Environment for Financial Reasoning Agents
Building FinQA: An Open RL Environment for Financial Reasoning Agents

TL;DR: We built FinQA — a financial question-answering environment with 290 expert-curated questions across 22 public companies, now available on OpenEnv. Agents use MCP tools to discover schemas, write constrained SQL queries, and answer multi-step questions from real SEC 10-K filings. Most open-source models struggle with this kind of multi-step tool use, and even frontier closed-source models, while more accurate,…

Mar 30, 2026
Learn more about Building FinQA: An Open RL Environment for Financial Reasoning Agents
1 2 36 37
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.