Continual Learning Bench
A benchmark evaluating whether AI systems genuinely improve from prior experience. Unlike static benchmarks that treat every task as independent, Continual Learning Bench measures performance across sequential, stateful task sequences, rewarding systems that accumulate and apply knowledge over time.


Most benchmarks make a core assumption: models are stateless. Once they complete a task, they move on to the next as if the first never happened. In practice, deployed systems encounter new information and operate in sequential environments where meaningful improvement should occur.
Continual Learning Bench is a benchmark of expert-validated task sequences across real-world domains (software engineering, data science, strategic modeling) where tasks are not independent, systems are expected to change during evaluation, and performance depends on what the system has seen before.
Leaderboard
| Rank | System | Avg Reward | Avg Gain | Avg Cost |
|---|---|---|---|---|
| 1 |
|
0.223 | 0.254 | $30.43 |
| 2 |
|
0.201 | 0.201 | $18.39 |
| 3 |
|
0.19 | 0.239 | $38.6 |
| 4 |
|
0.151 | 0.202 | $18.34 |
| 5 |
|
0.102 | 0.195 | $49.62 |
| 6 |
|
0.08 | 0.078 | $14.28 |
| 7 |
|
0.08 | 0.164 | $7.6 |
| 8 |
|
0.066 | 0.146 | $27.21 |
| 9 |
|
0.046 | 0.086 | $62.75 |
| 10 |
|
0.035 | 0.182 | $31.53 |
| 11 |
|
-0.002 | 0.094 | $13.32 |
| 12 |
|
-0.056 | 0.062 | $15.23 |
How it works
Stateful system vs. stateless baseline
Per-task breakdown
| System | Mean Cum. Reward | Mean Cum. Gain | Cost | Runs |
|---|---|---|---|---|
| ICL · GPT-5.4 | 46.198 ± 1.001 | 26.437 ± 1.001 | $1.93 ± $0.05 | 5 |
| Claude Code · Sonnet 4.6 | 44.282 ± 1.449 | 24.522 ± 1.449 | $10.40 ± $2.27 | 5 |
| ICL · Claude Sonnet 4.6 | 36.584 ± 1.262 | 16.825 ± 1.262 | $3.60 ± $0.17 | 5 |
| ICL Notepad · Claude Sonnet 4.6 | 35.993 ± 2.414 | 16.233 ± 2.414 | $2.99 ± $0.27 | 5 |
| Mem0 · GPT-5.4 | 33.794 ± 2.986 | 14.033 ± 2.986 | $1.39 ± $0.07 | 5 |
| ICL · Claude Opus 4.7 | 33.572 ± 3.082 | 13.813 ± 3.082 | $7.58 ± $0.42 | 5 |
| ICL · Gemini 3 Flash | 33.039 ± 0.879 | 13.279 ± 0.879 | $0.68 ± $0.02 | 5 |
| ICL · Gemini 3.1 Pro Preview | 33.033 ± 1.136 | 13.273 ± 1.136 | $3.84 ± $0.17 | 5 |
| Codex · GPT-5.4 | 32.828 ± 0.000 | 13.068 ± 0.000 | $3.15 ± $0.00 | 1 |
| ICL Notepad · GPT-5.4 | 31.915 ± 2.122 | 12.153 ± 2.122 | $1.02 ± $0.05 | 5 |
| ICL Notepad · Gemini 3.1 Pro Preview | 29.122 ± 3.011 | 9.362 ± 3.011 | $2.80 ± $0.53 | 5 |
| ACE · GPT-5.4 | 19.778 ± 0.009 | 0.017 ± 0.009 | $3.96 ± $0.33 | 5 |
Task suite 1.0
Methodology
Resources
Acknowledgments
The benchmark is led by researchers at UC Berkeley Skylab, UW-Madison, and Snorkel AI via the Open Benchmarks Grants program. Snorkel is actively collaborating on baseline human performance calibration for select tasks.
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.

