Research

Benchtalks #3: We taught AI everything except how to learn

Featuring Parth Asawa (Continual Learning Bench)

June 25, 2026
30 min read

For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration with Snorkel AI — the first standardized benchmark for measuring whether AI systems actually learn from experience over time.

Highlights

  • Coding agents re-read your entire codebase every session. Every single time. Any time you start a fresh session, a coding agent ls-es through your codebase and cats all the files just to understand what it looks like. Then you start a new session and it does the same thing all over again. After one or two tasks, a human builds a rough mental model of the dependencies and code structure and stops re-exploring. As Asawa puts it: “That capability just doesn’t exist in coding agents today.”
  • The field got so good at scaling, it forgot the original question. Pre-training on the internet gave models a massive jump in capabilities, so the field quietly put the learning question aside. Continual Learning Bench exists to bring it back. Asawa’s framing: “Rather than teaching the model how to learn, we just taught it a huge amount of capabilities.” His definition of the actual goal: sample-efficient online learning that’s stable over long horizons — not just recalling training data, but improving from experience the way a human does on the job.
  • In-context learning still beats fancier memory systems. One had 8× the cost and ⅓ the gain. Cumulative reward alone can’t tell you if a system is learning or just smarter to begin with — a better base model always scores higher even if it learns nothing. Continual Learning Bench’s gain metric fixes this: run every system twice, stateful and stateless, and the difference is how much it actually learned. At launch, best-in-class systems hit ~25% normalized gain. In-context learning leads the leaderboard. As Asawa notes: “A lot of these context management systems were probably designed without a clear thing to optimize for. They didn’t have an obvious signal. That probably influenced some of the design failures we’re seeing.”
  • Fable outperformed Opus and Sonnet on Continual Learning Bench. Asawa’s long-run bet is on something different. When Anthropic released Fable, they used Continual Learning Bench to show the model outperformed systems using Opus or Sonnet as a backbone — early validation that the benchmark is measuring something real at the frontier. Their approach: context management, training the model to distill general-purpose rules and manage its own memory more cleverly. Asawa sees it as one direction among many. His own bet: “The ceiling of parametric learning is going to be higher than context management. Context management just won’t scale as well or be as stable.”


Episode Transcript

What is continual learning?

VINCENT: Welcome back to Benchtalks. Today I’m with Parth Asawa, a PhD student at Berkeley advised by Matei Zaharia and Joey Gonzalez. He leads a bunch of really exciting research on continual learning. Welcome to Benchtalks, Parth.

PARTH: Thanks for having me, Vincent. Excited to have this discussion.

VINCENT: For anyone not living in this space, how would you describe continual learning, and why should they care that models are struggling with it today?

PARTH: To take a step back: when I’m talking about continual learning, to me that means sample-efficient online learning that’s stable over long horizons.

Continual learning is not a new concept we discovered with language models. It’s been around for over a decade. There’s a lot of work in the classical neural network space. But with language models specifically, there’s been a very obvious use case that’s popped up. We train language models today in a process where we curate a bunch of data, come up with a nice set of architectures and algorithms, put the model in a box for a month, train it, and then it comes out as a frozen model. We have data up until a certain point, but after that the model is static. It doesn’t inject new information on the fly. It doesn’t learn about world events in real time. It can’t update to particular people. It can’t run experiments and update its hypotheses. The limit of the adaptation it can do is within its context window.

So continual learning, in my definition, is studying how you can teach models to update themselves in a stable manner over long periods of time, and doing so in as sample-efficient a way as humans do, or potentially even better.

VINCENT: This is definitely in the zeitgeist right now. Why do you think it’s become top of mind?

PARTH: It picked up a lot in popularity around the end of 2025. We were seeing a lot of gains from scaling, and people who hadn’t believed in scaling laws before saw these crazy levels of progress in domains like code. It became a lot more real and tangible. Once people started acknowledging that this is real, it became a broader question: okay, what comes next? And I think the answer to that very naturally involved something like continual learning. People were saying: we have the models, we can train them to be really good up to a point, but if you want them to keep improving after that, continual learning is arguably necessary.

A lot of people are thinking about it from the applications side too. In applications you want models to improve from feedback from humans or from interactions in the environment, and that wasn’t necessarily happening. So the scaling laws got models to a point where people said okay, they can definitely do these applications. Now you have to answer for the long-tail use cases. That’s why I’d hypothesize it became a lot more exciting to people.

VINCENT: What I’m hearing is there’s a real difference between capability and learning in these models. How do you see that difference, and how should researchers be thinking about them distinctly?

PARTH: If you think about the initial questions we were asking in machine learning, teaching machines to learn like humans, the learning capability was the big thing we were trying to get right. But we stumbled onto a path where we found out scaling works really well. Pre-training on the internet gives you this really nice access to large amounts of data and that gave us a huge jump in capabilities. So we put the learning question aside for a bit. Rather than teaching the model how to learn, we just taught it a huge amount of capabilities.

But the key aspect of the learning-how-to-learn question is: can the model do it on its own continuously, even after initial training? In an ideal world these wouldn’t even be separate questions. From a pure point of view, you’d have one big training phase that teaches the model how to learn (meta-learning), and then after that there’s no pre-training, mid-training, post-training, or this Frankenstein approach we have today. It’s all just training. We’re in a weird spot where a lot of people who work on continual learning are focused on the post-hoc version, trying to inject new knowledge after the model has been trained. But from first principles, ideally you just teach the model how to learn and then it does all the learning on its own.


Why work on a benchmark?

VINCENT: There seem to be a lot of different ways to tackle this research question. You’ve done work on both methods and benchmarks. Why prioritize the benchmark?

PARTH: I started my research really focused on the sample-efficient learning problem in continual learning. A lot of people cared about catastrophic forgetting, but to me it felt like actually being able to learn in a sample-efficient way was the first-order problem. So I worked on things related to synthetic data, combining synthetic data with self-distillation.

But a theme in the work I was doing, and the work I was reading from others, was that all the evaluations were bespoke. I can probably bet that if you picked apart any three or four papers claiming to do continual learning in language models over the past few years, all the evaluations are different. And one of the historical trends in machine learning is that benchmarks are a way to advance or accelerate field progress: people agree on a set of definitions and capabilities they want to see, you have a standardized way of measuring it, and a lot more people can optimize against that. We didn’t have that in continual learning. We didn’t even agree on a definition.

So the benchmark was selfishly created in the sense that I wanted something I could optimize against, and that other people could point to and say, yeah, this is what we think continual learning should look like. One of the highest-leverage things I can do besides doing research on my own is being able to communicate my research. A standardized evaluation serves that purpose. I’m excited to see it grow, but I’m also excited to get back to methods now that we have something to optimize against.


Continual Learning Bench launch and reception

VINCENT: We had the pleasure of working with you on this. What has the reception been like since we launched it a few weeks ago?

PARTH: The initial reception was honestly way better than I expected. People were excited. Nobody had put out a benchmark for continual learning before. You have an opinionated way of evaluating it, it’s well motivated, it makes sense. So the initial reaction was great.

I had a bunch of follow-up conversations with application-layer companies but also researchers at labs. What are you thinking about, where does the benchmark go from here, if we want to do continual learning in our specific domain what might that look like? There’s still a wide design space and a long way to go. Evals haven’t necessarily caught up for a lot of companies. But there’s a lot of interest, and the question now is how to turn that momentum into real progress in the space.


Anthropic and the Fable release

VINCENT: Even in the last week or two, Anthropic researchers used Continual Learning Bench to evaluate new memory systems in the Claude family. What did you think of their approach?

PARTH: That was super cool. I thought it was great that even some of the frontier labs were thinking about using our benchmark for how they evaluate capabilities. In the Fable release specifically, there was a blog post about what the model looked like when managing its own context while interacting in these environments over sequences of tasks. Their approach was largely in the context management systems direction, with a model managing its own memory in a particular way. What they showed was that when they had Fable in their system, it outperformed systems using Opus or Sonnet as a backbone. That’s pretty cool, and it’s validation for the benchmark.

VINCENT: One of the things I found super interesting in that post was they talked about distilling more general-purpose rules and interweaving that into the memory system in a more clever way as one of the reasons Fable was more effective.

PARTH: Really cool to see different systems emerge as a result of better measurement techniques. Training models to get better at managing their own context and using that as a mechanism toward continual learning is definitely one of the directions. I think there are many other approaches in the design space, but it’s cool that people are looking at these tasks as a way to actually measure that ability to learn.


Benchmark design: what makes a good task

VINCENT: On the benchmark itself, how do you make real the types of tasks and all the nuances that make it an effective measurement of continual learning ability?

PARTH: The biggest difference from a traditional language modeling benchmark is this. In a traditional benchmark, like command-line tasks for TerminalBench or software engineering tasks in SWE-bench, you have an independent set of tasks. You run the model on one, then the other, maybe in parallel, and you get a score at the end based on aggregate performance. You’re measuring point capability.

In a continual learning benchmark, you’re no longer seeking to measure point capability, and you no longer want your tasks to be independent. Instead, you’re measuring performance over a sequence of tasks. These sequences have to be constructed so that continual learning capabilities actually yield better performance than non-continual-learning capabilities. Rather than constructing a bunch of independent tasks, you’re constructing a sequence of maybe 20 or 50 tasks that the model goes through in a stateful manner. You let the model do task two conditioned on having done task one, and you expect a gain in performance as a result of that conditioning.

So you have to construct tasks where there’s room for improvement from having done prior tasks. There’s a shared latent structure between tasks, something the model learns about the environment that it can exploit to get better over time. And it has to be something where, if all the tasks were completely random and different, you wouldn’t expect improvement. There’s some art in this. I don’t think it’s an exact science.

VINCENT: What I’m hearing is there’s a component of latent structure that needs to be captured, natural dependencies between tasks that real-world environments actually reflect. What are some specific tasks that demonstrate this?

PARTH: The one that resonates most with computer science researchers is the programming task. Take a coding agent in your particular repository. A behavior you might notice: any time you start a fresh session, it’ll run 60 commands, ls-ing through your codebase, cat-ing all the files, just to understand what it looks like. And then you start a new session and it does the same thing all over again.

That’s obviously wasteful. If I were a human in that codebase, after one or two tasks I’d have a rough mental model of the dependencies and code structure. I wouldn’t need to re-explore everything from scratch. That capability just doesn’t exist in coding agents today.

So one of the tasks gives the model a bunch of tasks in the same repository and asks: did the model get more efficient at completing them over time? Efficiency was measured as the number of commands required to achieve a correct solution.

We did a similar thing for natural language to SQL (NL-to-SQL). If you’re a data engineer working with a bunch of massive databases with many tables, the first time you’re dropped in you have to learn how the tables link together. But over time you understand the schema, you know what queries you’d need to run, and you can do it in fewer steps. We also added things like database migrations in the middle. Schemas change or get new columns, and you don’t want models to overfit to one particular representation. Part of the task tests whether models are stable when there’s drift in the environment.

And there’s a poker-style task, which resonated with a lot of people. If there’s an amateur playing with a fixed policy (say, they call every single time), a human would realize this pretty quickly and start exploiting them. But if you run models independently in separate sessions, they never learn to exploit players with fixed policies. These are all examples of shared latent structure that exists in the environment (whether it’s code structure, database schema, data distribution, or the opponent’s policy) that a continually learning model would learn to exploit.

VINCENT: One of the most important things we had to iterate on was that balance between realism, consulting real domain experts, iterating on reasonable preconditions and spec changes, getting the metrics right.

PARTH: Exactly. Working with domain experts is one of the biggest parts of building this benchmark. Even if we’re not experts in poker or data science ourselves, being able to consult with them and iterate on what realistic learning would look like, what realistic performance might look like, and asking: is there a mechanism to actually improve? That’s one of the most important things to validate. If yes, then we can credibly say the benchmark is measuring continual learning.

It was interesting to get intuition for: what would you expect from someone onboarding into this environment versus someone who’s done it for years? What should they be good at, and how do you measure that?


The gain metric

VINCENT: Speaking of metrics, let’s talk about gain. We spent a lot of time internally defining and debating this. What is it, and why did we land on it?

PARTH: Gain is an interesting metric because you don’t see it in traditional language modeling benchmarks. There’s no notion of stateful versus stateless performance. You have a task, you measure it, and you’re done.

Let me start with reward. For every task instance, you need some way of measuring performance: how efficient you were at the code question, what your profit was on the poker task, how close your prediction was. The simplest formulation of a continual learning metric is cumulative reward across the sequence of tasks, just how well did you do in total, and you compare systems on that.

The challenge is if you use cumulative reward in isolation, there’s a confound. Say I have GPT-7 and GPT-6 with no difference in their continual learning capabilities. GPT-7 is just smarter overall. When you run them on these sequences and measure cumulative reward, GPT-7 might just do better as a function of having better baseline capability. It might take fewer steps on the database question, be better at data science in general. We try to design tasks so that you’re always required to adapt online over time no matter how good your initial capability is, but we can’t fully separate out that base capability. So if you measured cumulative reward alone, a better base model could show up as a better continual learner even if it isn’t.

To isolate actual learning ability, we introduced gain. The gain metric separates out how much of your performance improvement was from learning while doing the task versus how much was just your baseline capability.

Mechanically: for every system we run it two ways. One is stateful, where the model is allowed to condition on all prior instances and you measure its reward at every step. The other is stateless. We reset the system completely, wipe its memory, and run each instance independently, like a traditional benchmark. The gain at any instance is: stateful reward minus stateless reward. Cumulative gain across the sequence is how much the system learned.

Of course gain doesn’t capture overall capability, so reward, gain, and cost all need to be plotted together. You should want a system that uses minimal compute but gets maximum gain and reward. You can’t reduce everything to a single number.


What good looks like

VINCENT: How do you think about what a genuinely good system looks like?

PARTH: It’s hard to define when it’s multidimensional. Personally I weight gain heavily, and if you keep improving on reward there’s some correlation to gain, these things aren’t fully separated. But when you see a good system, you’ll have some base level of capability tied to the model backbone you’re working with, and then you want to see how far your methods can push performance higher across sequences of tasks.

We put out baseline systems in the initial release, various ways people have tried context management or in-context learning today. There’s definitely headroom for improvement in all the tasks. What I look for: given similar model capabilities, can you improve cumulative reward, improve cumulative gain, and ideally not blow up your cost doing so?

VINCENT: At launch, 25% normalized gain was roughly where the best models were landing. I think in-context learning remains one of the most effective baselines. It’s still number one on the leaderboard, even outperforming fancier context or memory management systems. There was one that had 8x the cost but only a third of the gain overall. What does that say about the agent-memory direction right now?

PARTH: Context management systems are relatively low-hanging fruit. A lot of people can iterate on them, they’re easy to work on, low cost. But the challenge is a lot of this work happened before there were well-established benchmarks for what continual learning should look like. People didn’t know how to measure it, so these systems might have been designed without a clear thing to optimize for. They knew they wanted to use context and remember things and improve over time, but there wasn’t an obvious signal. That probably influenced some of the design failures we’re seeing.

To be clear, I think they can be fixed. If you know what you’re optimizing for and have real signals from benchmarks, you can improve the flaws we see. But that’s why they might not have done as well as some in-context learning systems to begin with. And parametric systems are a whole different design space. They could look completely different.


Failure modes

VINCENT: In the failure mode analysis in the paper, a lot of these models really struggled to update their beliefs. They received feedback but were pretty rigid in how they proceeded despite many instances of new evidence. What did that look like in practice?

PARTH: This tells you a lot about where improvements in context management still need to happen. We saw cases where models were really rigid in their beliefs early on. They picked up on a signal they thought was real, but it wasn’t the actual learning signal in the benchmark, and then they overfit to it. Even when there’s drift in the environment, they really struggled to update their prior knowledge.

Take the database example. The models figured out the schema and the idiosyncrasies of the data, and even when explicitly told there’s been a database migration, they struggled to pick up that things could have possibly changed. It’s not a capability that’s been optimized for. Some models were really struggling with: things have changed in the environment, I need to update my beliefs, this challenges my mental model of the world, so I should update.

In the context management space, I think these brittle failure modes will get ironed out over the course of 2026. More stable systems will emerge. But in the parametric learning world, we’re way earlier. We don’t even fully know what the failure modes are for parametric systems. A lot of people like to talk about catastrophic forgetting. I like to talk about the inability to even learn in a sample-efficient manner in the first place. Put those together and be stable over long horizons. That’s the goal, and there’s much more space to go there.


Most promising systems and parametric methods

VINCENT: What are the most promising systems for continual learning in your view, and how do you think about parametric methods?

PARTH: Maybe a slightly hot take, but I’m personally very optimistic on parametric systems for continual learning. In the short term we’re going to see a lot of progress in context management because it’s easier to iterate on, the lower-hanging fruit. But if you want to see the highest possible gains, the highest possible improvements from experience, I think that’s going to look like a parametric learning system. The intuition: being able to update the base policy as a result of experience influences your future exploration, influences your priors and representations. Compressing information down is also somewhat of a proxy for intelligence. The design space for methods that can parametrically update your base policy is very wide and hard to get right, but the ceiling is going to be higher than context management systems. Context management just won’t scale as well or be as stable.

My current hypotheses are investing more into changing the architectures of these models to be better at continual learning from the ground up. Right now we train language models to a certain point and then say, okay, we want to add new knowledge post-hoc. It’s not clear that the model architectures or training methods up to that point lead to a model that’s conducive to continual learning afterward.

If I were designing from first principles, I’d probably change the architecture to introduce something like a bridge from the context back into the weights. That doesn’t even exist in any architecture right now. The KV cache (the model’s short-term memory) is ephemeral, transient, not persistent. You could imagine architectures more conducive to a meta-learning phase where you spend time teaching the model how to actually learn, and then after that, the learning is just training. No pre-training, post-training, or the Frankenstein approach we have today. I’m really excited about these different architectures and the different training methods and data that would go with them.

VINCENT: As a data researcher, I find it interesting to follow where the systems are going. When we worked on TerminalBench, a lot of our approach was building the training curriculum, driving easier tasks, helping models hill-climb effectively, with each gradation of difficulty requiring slightly different methods.

PARTH: The data is probably going to look pretty different for continual learning. Pre-training data is all the knowledge we put on the internet. But humans are maybe born with an innate ability to know how to learn, and that data doesn’t really exist on the internet. Maybe it’s textbooks or teacher-student interactions, but it’s different. Humans know how to do this learning inherently. The loss functions and data to actually induce that behavior in language models is potentially going to look very different. Very exciting from a data research perspective.


Open science and AI safety

VINCENT: I’d love to change gears to talk about a blog post you and Joey Gonzalez put out recently, a false dichotomy in AI today between unsafe open models and too much power consolidated in a few closed labs. Tell me more about that and what you see as the third path.

PARTH: This is a really pressing and urgent problem. We’ve been talking about it amongst ourselves in the lab for months. We’re heading to a world where people are putting everyone into two camps. Camp one: safety is really important, so shut off open source and open weights, consolidate everything into closed APIs, have safeguards and authentication protocols to more safely distribute capabilities. As we saw with recent events, even that might not be enough assurance from a regulatory perspective. But that’s one point of view.

The other point of view: if you believe in the safety line of work, one of the consequences is a consolidation of power among a few players. Consolidation not just in having the frontier models, but in having the ability to determine who accesses them, having the expertise to influence policy despite conflicts of interest, and having enormous economic power that comes with owning the frontiers of intelligence. Consolidation of power is a real threat to democracy. In a democracy you really want knowledge and information and power to be distributed to elected officials, not concentrated in private companies.

We were growing really concerned that everyone was saying you’re in one of these two camps, they’re both evils, pick the lesser. But we’re creative people. We can think of better solutions.

One path we put forward: what might it look like to have third-party institutions backed by the government, sharing resources with the frontier labs, responsible for training models at the frontier and doing the science? That doesn’t mean releasing open weights. It means building up the expertise to understand what model training looks like, what model evaluation looks like, what alignment science looks like. That informs regulation. Without it, you get bad regulation, a lot of irrational actions from regulators who are uninformed about what’s going on.

You might think: isn’t this what OpenAI was supposed to be? Arguably yes. But the problem was they needed a financial mechanism to support the scale of research they wanted to do, and bringing in for-profit introduces mixed incentives. Back then the government might not have been willing to bet on unproven technology. But we’re not there now. The technology is very proven. The government should step in, support these institutions, and I think it’s actually in the interest of frontier labs too. If you want real informed policy on your models, you need informed people advising policy makers.

VINCENT: What do you think individual researchers can do to drive more of a response to the problems you laid out?

PARTH: Things can be bottom-up and top-down. You need both. If you’re working at a lab, push internally for what open-sourcing different aspects of the work might look like, push the frontiers of science forward. Simultaneously, the labs can’t do this alone. Academics need to get involved, think about what the design of these institutions might look like, how to get involved in advising policy makers, how to think about what AI’s impact on society will look like beyond the scope of the specific research you’re doing.

In grad school especially, it’s easy to get tunnel vision. But thinking through the bigger impacts and figuring out how your research can support that, or how you can get involved in policy design, will determine a lot of how the future of AI goes. There aren’t many people involved right now. It’s a very open space to do a lot of good and impactful work.

I’m also a big personal proponent of open science, making work more legible to the broader community. Benchmarks are a big part of this. There are obviously flaws with only measuring model capabilities through public benchmarks, but it’s one of the best ways to standardize and bring empiricism to where capabilities are going, rather than the one-off vague posts you see on certain social media sites.

And there are projects like Percy Liang and the Open Athena team working on Marin. Literally every component and change being made in building these models from scratch is being documented in the open. That’s one of the most exciting ways to bring visibility into how models are built to the broader public. Open science doesn’t have to just mean open weights if safety is your concern. There’s a lot more we can do.

VINCENT: Even in your own work, “How to Train Your Advisor” reads almost like a technical answer to your own blog. Reclaiming control over black-box models you don’t own. Do you see the research and the policy writing as the same project?

PARTH: I actually didn’t think of advisor models from a policy or open science perspective when I was working on it. I came at it from the lens of customization and specialization. We want the ability to adapt models to our use cases. And in some sense what we were doing was democratizing that ability, because you can’t train the frontier models yourself. If you think there are benefits to parametric learning over your data, training smaller open-source policies to work with frontier models is a way to do that. Framing it as democratizing access to parts of science is a better frame than how I framed it initially.

Continual learning is also one of those things where maybe the end world doesn’t look like single shared frontier models and everyone managing their context. Maybe there are individual models for each of us, deconsolidating power a bit, giving it back to users because you own the intelligence and it’s extremely tailored to you. So I think there’s alignment between the technical research and the policy writing, even if I didn’t come at it that way initially.


Lightning round

VINCENT: Quick lightning round. Quick takes are fine even if unsubstantiated. What’s your timeline for continual learning, or the “takeoff” as some people describe it?

PARTH: I’ll separate those out. For continual learning, making systems meaningfully better, I think we’ll see a lot of progress in context management and making systems better in 2026 and early 2027. Parametric continual learning, which is where I believe the actual end-state of continual learning will be, is a lot more in the research phase right now. But maybe in 2027 we’ll see really cool parametric solutions. Potentially sooner, but it might take that long. Betting on a research timeline is always hard.

For recursive self-improvement or takeoff, I have much wider error bars. Could be 2028, could be sooner. Closing the loop is much harder to predict than people tend to assume.

VINCENT: Best thing researchers can do to get involved in the deeper implications of their work, policy and societal issues with AI?

PARTH: Get involved in the regulatory discussion. Start having comments and responses to what’s going on in regulation. Try to advise policy makers. Have conversations with people at frontier labs or in government about what better institutions might look like. And maybe push your own research in that direction. If AI is going in a particular direction, what should I research technically to give more power back to users?

VINCENT: A benchmark you wish existed but doesn’t?

PARTH: Two answers. One: personalization benchmarks. Any benchmark where you’re trying to adapt to a user requires a good user simulator, but user simulation is incredibly nuanced. You can’t reduce users to a few axes. They’re so high-dimensional in how they act. I think personalization benchmarks are really hard to build well and we need more of them.

Two: benchmarks that look not just at AI in isolation, but at humans and AI working together, what human-AI uplift actually looks like. AI in education is a great example, or AI in the workforce. We do a lot to measure AI point capabilities, but AI is interacting with humans and part of that should be improving humans, making them better at learning or doing what they do. There’s very little in that space, and it’s a very open design space.

VINCENT: And how can people contribute to Continual Learning Bench?

PARTH: Two big things: tasks and systems. Join the Discord. We talk about both there and in the docs.

On tasks, we’re always looking for more domain-specific tasks. Legal domain, healthcare domain, what does continual learning look like there? We’re also looking for more longer-horizon tasks. The initial set were reasonable in horizon length, not too expensive to run, but continual learning in the end state should be stable over extremely long horizons. More tasks that look at those end results would be very interesting.

On systems: actually benchmarking parametric methods with open-source models. In real sample-efficient learning settings at the frontier, how well do they do? We’d love to see more work there.

Please reach out, join the Discord, and come talk to us.

VINCENT: Thanks so much, Parth. Really excited about these new directions.

PARTH: Thanks for having me. This was super fun.

Share this article
Vincent Chen headshot
Vincent Sunn Chen
Research Fellow & Founding Team

Vincent Sunn Chen is a Research Fellow on the founding team at Snorkel AI. His work centers on systems for high quality AI evaluation & data development with experts in the loop. He currently leads the Open Benchmarks Grants, a $3M commitment to funding benchmarks and infrastructure for frontier agents. Prior to Snorkel, Vincent was a researcher at the Stanford AI Lab, where he studied the foundations of data-centric AI systems.

Recommended articles

View all articles
alex-meta-scale-thumbnail
Agentic AI evaluation: Closing the gap with better benchmarks and data
Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to close that
June 23, 2026
Snorkel Team
judgment-bench
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based
June 18, 2026
Snorkel Team
benchmarks-3-axis
The Art and Science of Building AI Benchmarks That Shape the Field
Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with
June 16, 2026
Snorkel Team
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.