RESOURCES

Blog

Ideas, updates, and practical guidance from the Snorkel team.

Image for Closing the Evaluation Gap in Agentic AI

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants

February 11, 2026
All articles
Sort: Newest
Agents’ Last Exam: AI Benchmarking for Real Work
NEW
Agents’ Last Exam: AI Benchmarking for Real Work

At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can…

Jun 30, 2026
Learn more about Agents’ Last Exam: AI Benchmarking for Real Work
Continual learning and evaluating how AI agents learn across sequences of tasks
NEW
Continual learning and evaluating how AI agents learn across sequences of tasks

Most agent benchmarks evaluate each task as an independent episode. The agent receives a task, produces an answer, gets scored, and moves on. The next task starts as if the previous one never happened. That setup misses a core requirement for deployed agents. A coding agent, research assistant, data analyst, or workplace assistant should improve as it works across repeated…

Jun 29, 2026
Learn more about Continual learning and evaluating how AI agents learn across sequences of tasks
Benchtalks #3: We taught AI everything except how to learn
NEW
Benchtalks #3: We taught AI everything except how to learn

For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration…

Jun 25, 2026
Learn more about Benchtalks #3: We taught AI everything except how to learn
Agentic AI evaluation: Closing the gap with better benchmarks and data
NEW
Agentic AI evaluation: Closing the gap with better benchmarks and data

Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to close that…

Jun 23, 2026
Learn more about Agentic AI evaluation: Closing the gap with better benchmarks and data
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based…

Jun 18, 2026
Learn more about JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
The Art and Science of Building AI Benchmarks That Shape the Field
The Art and Science of Building AI Benchmarks That Shape the Field

Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with…

Jun 16, 2026
Learn more about The Art and Science of Building AI Benchmarks That Shape the Field
Cua-Bench: benchmarking computer-use agents on professional software
Cua-Bench: benchmarking computer-use agents on professional software

TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —…

Learn more about Cua-Bench: benchmarking computer-use agents on professional software
The standard for agents you can trust: Lessons from the federal front lines
The standard for agents you can trust: Lessons from the federal front lines

In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on…

Jun 05, 2026
Learn more about The standard for agents you can trust: Lessons from the federal front lines
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control….

Jun 04, 2026
Learn more about Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
1 2 38
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.