Continual Learning Bench by Berkeley & Snorkel

RESOURCES

Blog

Ideas, updates, and practical guidance from the Snorkel team.

Research

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants

Vincent Sunn Chen

February 11, 2026

2026: The year of environments

We just returned from NeurIPS 2025, and we’re still processing everything we saw. The energy around data-centric AI has never been stronger—and we couldn’t be more grateful to the research community for pushing these ideas forward.

Dec 10, 2025 •

Snorkel Team

Learn more about 2026: The year of environments

Part V: Future direction and emerging trends

Explores how rubrics support agentic, multi-turn, tool-using, multimodal, and code-generating AI systems, and how they evolve with AI feedback and ensemble evaluation.

Dec 05, 2025 •

Justin Bauer

Learn more about Part V: Future direction and emerging trends

The self-critique paradox: Why AI verification fails where it’s needed most

TL;DR: We stress-tested the “generate → criticize → improve” loop on 50 visual reasoning tasks. The results were counterintuitive: self-critique acts as a corrosive agent on high-performance tasks, turning 98% accuracy into 57%. Yet, for tasks where models fail completely, it works like magic. This difficulty-dependent behavior poses a critical, hidden risk for RLFT pipelines. The promise vs. the reality…

Nov 26, 2025 •

Armin Parchami

Learn more about The self-critique paradox: Why AI verification fails where it’s needed most

A chat with the Terminal-Bench team

Snorkel Chief Scientist Fred Sala and Kobie Crawford chat with the Terminal-Bench team to unpack the design behind Terminal-Bench 2.0 and the new Harbor framework.

Nov 19, 2025 •

Kobie Crawford, Fred Sala

Learn more about A chat with the Terminal-Bench team

Intelligence per watt: A new metric for AI’s future

Snorkel AI contributes specialized datasets to Hazy Research’s “Intelligence-per-Watt” study, advancing how efficiently AI turns energy into intelligence.

Nov 12, 2025 •

Kobie Crawford

Learn more about Intelligence per watt: A new metric for AI’s future

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Terminal-Bench 2.0 launches today, marking a major leap in AI agent evaluation. Snorkel AI contributed key research and task design to this release.

Nov 07, 2025 •

Kobie Crawford

Learn more about Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Snorkeling in RL environments

We unpack what makes a high-quality RL environment for LLMs and show how we build realistic, enterprise-grade environments at Snorkel AI.

Nov 04, 2025 •

Armin Parchami

Learn more about Snorkeling in RL environments

Introducing SnorkelSpatial

A procedurally generated and programmatically verified benchmark for evaluating spatial reasoning capabilities in LLMs Large language models (LLMs) are showing remarkable results on solving complex reasoning problems across domains—from mathematical proofs and logical puzzles to graduate-level science and engineering questions. On the other hand, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday tasks. We…

Oct 24, 2025 •

Harit Vishwakarma

Learn more about Introducing SnorkelSpatial

Scaling trust: rubrics in Snorkel’s quality process

Snorkel’s “Trusted Scale” philosophy Welcome to Part 4 of Snorkel AI’s rubric series. In previous posts, we explored how rubrics enable structured evaluation (Part 1), the spectrum of rubric types and use cases (Part 2), and the science behind designing and validating them (Part 3). In this latest installment, we pull back the curtain on how Snorkel puts these principles…

Oct 16, 2025 •

Derek Pham

Learn more about Scaling trust: rubrics in Snorkel’s quality process

1 2 … 35 36

Join our newsletter

For expert advice, the latest research, and exclusive events.

Blog

Closing the Evaluation Gap in Agentic AI

Join our newsletter

How do you want to work with Snorkel?