Continual Learning Bench by Berkeley & Snorkel

RESOURCES

Blog

Ideas, updates, and practical guidance from the Snorkel team.

Research

Benchtalks #3: We taught AI everything except how to learn

Featuring Parth Asawa (Continual Learning Bench)

Vincent Sunn Chen

June 25, 2026

NEW

Claude Opus 5: Performance and Error Analysis on Frontier Coding Tasks

Anthropic’s Claude Opus 5 recently debuted as the second model overall on the current Senior SWE-bench leaderboard, behind Fable 5. It also achieves the highest score of any evaluated model on the benchmark’s Bug & Performance Investigation category, reinforcing the rapid progress frontier coding models continue to make on increasingly realistic software engineering tasks. Just as notable, Opus 5 reaches…

Jul 27, 2026 •

Ankit Aich

Learn more about Claude Opus 5: Performance and Error Analysis on Frontier Coding Tasks

Senior SWE-Bench: Evaluating Coding Agents Like Senior Engineers

At our latest Snorkel AI Reading Group, Henry Ehrenberg presented Senior SWE-Bench, an open-source, Harbor-compatible benchmark for evaluating coding agents on realistic, senior-level software engineering work. Its 100 tasks, with 50 public and 50 kept private to mitigate contamination, are sourced from real pull requests across 12 production repositories and cover complex features, migrations, bugs, and performance issues. Senior SWE-Bench…

Jul 16, 2026 •

Snorkel Team

Learn more about Senior SWE-Bench: Evaluating Coding Agents Like Senior Engineers

Grok 4.5 Testing Results: How SpaceXAI’s New Model Performs on Real Professional Work

We’ve evaluated Grok 4.5 on Snorkel’s GDPval+ dataset, Snorkel’s expert-created dataset of professional workplace reasoning tasks from across the economy. To compare performance against other frontier models, we ran the evaluation alongside GPT 5.5 and Claude Opus 4.8. Overall, Grok 4.5 demonstrated the strongest overall performance. Dataset GDPval+ is part of the Snorkel Data Series (SDS), Snorkel’s portfolio of expert-curated…

Jul 08, 2026 •

Jacob Fleisig

Learn more about Grok 4.5 Testing Results: How SpaceXAI’s New Model Performs on Real Professional Work

Agents’ Last Exam: AI Benchmarking for Real Work

At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can…

Jun 30, 2026 •

Snorkel Team

Learn more about Agents’ Last Exam: AI Benchmarking for Real Work

Continual learning and evaluating how AI agents learn across sequences of tasks

Most agent benchmarks evaluate each task as an independent episode. The agent receives a task, produces an answer, gets scored, and moves on. The next task starts as if the previous one never happened. That setup misses a core requirement for deployed agents. A coding agent, research assistant, data analyst, or workplace assistant should improve as it works across repeated…

Jun 29, 2026 •

Chris Glaze

Learn more about Continual learning and evaluating how AI agents learn across sequences of tasks

Benchtalks #3: We taught AI everything except how to learn

For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration…

Jun 25, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #3: We taught AI everything except how to learn

Agentic AI evaluation: Closing the gap with better benchmarks and data

Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to close that…

Jun 23, 2026 •

Snorkel Team

Learn more about Agentic AI evaluation: Closing the gap with better benchmarks and data

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based…

Jun 18, 2026 •

Snorkel Team

Learn more about JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

The Art and Science of Building AI Benchmarks That Shape the Field

Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with…

Jun 16, 2026 •

Snorkel Team

Learn more about The Art and Science of Building AI Benchmarks That Shape the Field

1 2 … 38

Join our newsletter

For expert advice, the latest research, and exclusive events.

By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Blog

Benchtalks #3: We taught AI everything except how to learn

Join our newsletter

How do you want to work with Snorkel?