Continual Learning Bench by Berkeley & Snorkel

RESOURCES

Blog

Ideas, updates, and practical guidance from the Snorkel team.

Research

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants

Vincent Sunn Chen

February 11, 2026

Building FinQA: An Open RL Environment for Financial Reasoning Agents

TL;DR: We built FinQA — a financial question-answering environment with 290 expert-curated questions across 22 public companies, now available on OpenEnv. Agents use MCP tools to discover schemas, write constrained SQL queries, and answer multi-step questions from real SEC 10-K filings. Most open-source models struggle with this kind of multi-step tool use, and even frontier closed-source models, while more accurate,…

Mar 30, 2026 •

Bhavishya Pohani

Learn more about Building FinQA: An Open RL Environment for Financial Reasoning Agents

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

The Snorkel research team collaborated with the rLLM team at UC Berkeley on the Agentica project, using their open-source rLLM framework to fine-tune Qwen3-4B-Instruct-2507, delivering a model that beats Qwen3-235B-A22B on Snorkel AI’s expert-curated financial benchmarks – at 1/60th the size. A full breakdown of the results are published in the rLLM blog here. The key insight? Just focus on…

Feb 18, 2026 •

Chris Glaze

Learn more about How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Coding agents don’t need to be perfect, they need to recover

Error analysis of 8 models on Agentic Coding tasks Successful completion of complex tasks doesn’t come from models being always right. It comes from models being resilient when things go wrong. To get a deeper understanding of model behavior in agentic environments, our team analyzed all of the errors found in the full traces of tasks from our Agentic Coding…

Feb 13, 2026 •

Ramya Ramakrishnan

Learn more about Coding agents don’t need to be perfect, they need to recover

Closing the Evaluation Gap in Agentic AI

Today, AI is marked by a growing asymmetry: the excitement around agentic AI is real — backed by quantitative progress on model cards and genuine leaps forward, especially in coding. But ask individuals or enterprises where they feel ready to deploy agentic automation in high-stakes, domain-specific settings outside of coding… and you will find hesitation. The reason: our ability to…

Feb 11, 2026 •

Vincent Sunn Chen

Learn more about Closing the Evaluation Gap in Agentic AI

SlopCodeBench: Measuring Code Erosion as Agents Iterate

SlopCodeBench reveals how AI coding agents degrade code quality over time—measuring “slop,” technical debt, and architectural erosion across iterations.

Jan 20, 2026 •

Kobie Crawford

Learn more about SlopCodeBench: Measuring Code Erosion as Agents Iterate

Introducing the Snorkel Agentic Coding Benchmark

Today, we’re sharing details about the Snorkel Agentic Coding benchmark—a comprehensive evaluation suite designed to test whether agents can handle the full complexity of software engineering work.

Jan 09, 2026 •

Kobie Crawford

Learn more about Introducing the Snorkel Agentic Coding Benchmark

2026: The year of environments

We just returned from NeurIPS 2025, and we’re still processing everything we saw. The energy around data-centric AI has never been stronger—and we couldn’t be more grateful to the research community for pushing these ideas forward.

Dec 10, 2025 •

Snorkel Team

Learn more about 2026: The year of environments

Part V: Future direction and emerging trends

Explores how rubrics support agentic, multi-turn, tool-using, multimodal, and code-generating AI systems, and how they evolve with AI feedback and ensemble evaluation.

Dec 05, 2025 •

Justin Bauer

Learn more about Part V: Future direction and emerging trends

The self-critique paradox: Why AI verification fails where it’s needed most

TL;DR: We stress-tested the “generate → criticize → improve” loop on 50 visual reasoning tasks. The results were counterintuitive: self-critique acts as a corrosive agent on high-performance tasks, turning 98% accuracy into 57%. Yet, for tasks where models fail completely, it works like magic. This difficulty-dependent behavior poses a critical, hidden risk for RLFT pipelines. The promise vs. the reality…

Nov 26, 2025 •

Armin Parchami

Learn more about The self-critique paradox: Why AI verification fails where it’s needed most

1 2 … 36 37

Join our newsletter

For expert advice, the latest research, and exclusive events.

By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Blog

Closing the Evaluation Gap in Agentic AI

Join our newsletter

How do you want to work with Snorkel?