During our Enterprise LLM Summit, Snorkel AI co-founder Alex Ratner sat down with Stanford Computer Science Professor Percy Liang for a conversation about what he and his colleagues are doing at the Stanford Center for Research on Foundation Models (CRFM), and about why Liang views this as such an exciting time for evaluation in machine learning and in AI generally. The transcript that follows has been edited for clarity and brevity.
Alex Ratner: I’d like to start with the term “foundation model.” We now see people using a number of different terms (large language model, foundation model, generative AI) synonymously. Sometimes they just use ChatGPT as a moniker for one of these large pre-trained, self-supervised models.
Why did you and the CRFM group pick “foundation model?” Were you at all surprised by the robust debate over the terminology? Lastly, how does that terminology align with your perspective on CRFM’s objective?
Percy Liang: When we founded the center, it involved me going around Stanford University and seeing who was interested in this phenomenon that was happening post-GPT-3. It was clear that this was going to be a big paradigm shift. Maybe we didn’t anticipate it would happen so quickly, but we knew that this was going to go somewhere. And we felt that it was a phenomenon that was deeper than just language models.
Technically, a language model is just a model over language. Often (but not necessarily) it’s associated with auto-regressive language models, where you predict the next word. But of course, there are vision models and other modalities. We felt “language model” undersold the potential of these models. So we coined the term “foundation model.”
A foundation model is trained on broad data and can be adapted lightly to a wide range of downstream tasks. We thought a lot about the name and the definition. There is a whole section on naming in the report we wrote, but the important part is that it behaves as a foundation. Instead of people building bespoke models and bespoke datasets in a vertical sense, you have this foundation, which gets built once based on a ton of capital. Then this model can be adapted to a wide range of different tasks, including question answering, customer service use cases, information extraction, et cetera. To us, that signified a paradigm shift in how AI systems were built.
We felt “language model” undersold the potential of these models. So we coined the term “foundation model.”
Percy Liang, Stanford Professor
In short, “foundation models” is a class that contains large language models. But it also contains visual language models, things like CLIP, and so on.
AR: That makes a lot of sense. I like the foundation metaphor. There is often some “house building” required on top. Obviously a lot of what we’ve done at Snorkel, Stanford, the University of Washington, and many other places is about that house building on top. How do you adapt it or fine-tune it or otherwise customize it?
For specific enterprise settings (i.e. a group that has its own data and objectives) how do you adapt and customize the model? That’s the house on top of the foundation.
PL: That’s a really important point. If, for example, you think about ChatGPT or people who are trying to build AGI, it’s a whole stack.
“Foundation model” illustrates that we’re not building the whole stack. We’re building the foundation. You can’t move into a house if you don’t have the rest of the house, but once you have a strong foundation, the house is much easier to build.
People want to build different houses in different ways, and the idea is that you should be able to customize the style of the model that you want. Every enterprise has different data, and customization is a key part of the paradigm.
AR: Many enterprises and other real-world practitioners are finding out now that you can’t figure out how to do customization, or identify where you need it, without some evaluation metric that’s fine-grained enough.
What are some of your intuitions on where that “house building” is most needed? Where do these models not work out of the box?
PL: These models really excel in the prototyping phase. Rather than going to a meeting and making a month-long plan to build some prototype, you prompt it and in five minutes you have a working prototype. The power of being able to do that can’t be overstated. It helps you to brainstorm and to think of things that you might not have on your own.
Prototyping obviously is very different from building an actual robust system. As models get better, of course, you start pushing your capabilities out until you can build something that’s not just a prototype, but actually a working system. In the limits of an application that has many users, custom data, and in which you want fine-grained control, I think you have to customize and fine-tune the data. Otherwise, you’re losing a lot of potential.
It is a progression, and in certain cases you can get quite far out-of-the-box. But, if you’re generating user feedback there has to be a way to use it because the system is not going to read your mind. It can’t be perfect.
you prompt it and in five minutes you have a working prototype. The power of being able to do that can’t be overstated.
Percy Liang, Stanford Professor
AR: It makes a lot of sense. There may be universal grammars in various data modalities out there. But for what you want to do—what your users want, what your enterprise wants—there’s no mind reading.
I really like that metaphor of “first mile” acceleration, and then “last mile” tuning and development. We’re also seeing a trend where you’ll use one of these massive generalist models to do the first mile, and then you’ll not only fine-tune but also distill or shrink the model into smaller, cheaper, lower-latency specialists.
This naturally leads us to evaluation. If you’re going to do your first-mile explorations, you can’t responsibly ship something to production and you can’t find out what tuning you need in order to traverse that last mile unless you have a sense of how the model is performing. A big part of what you’ve been focused on is the HELM [Holistic Evauation of Language Models] project. We’d love to hear a little bit about that.
PL: Yes, there is so much to say. Let’s see what we can do.
We released HELM about a year ago. It stands for holistic evaluation of language models. We’re trying to develop a standardized way of evaluation. We started with language models, but now we’re moving to multimodal models as well. The challenge is: how do you evaluate a language model? It’s a generalist system, it’s not like you’re evaluating a spam classifier and you’re getting just a notion of accuracy or AUC. You have to imagine and cover the space of possible use cases, which is challenging. On top of that, we focused on not just accuracy but also bias, robustness, calibration, efficiency, and all these additional factors. It’s holistic because we try to look at all of these different dimensions, scenarios, and about 30 separate models and evaluate systematically.
It’s on the website as a community resource—all the predictions, the code, everything. We’ve been updating it over time because a model comes out every week or so and we are trying frantically to keep up.
You should think about HELM as a framework for evaluation, where you can come with a particular evaluation dataset, which could be custom, or you come with your model, which could be a standard model or a custom model that you fine tune. We then do the prompting and the finagling to produce numbers and reports. We’ve seen people and companies using it for their own purposes, and that’s quite exciting. I hope that it will grow into a much more standard platform for the evaluation of foundation models.
We hope to populate HELM with a wide variety of benchmarks that enterprises would care about.
Percy Liang, Stanford Professor
I should mention that we announced today that we’re working with ML Commons to develop safety evaluations on top of HELM, which is really exciting because that’s an area of utmost concern.
We should think about evaluating language models in terms of both upstream and downstream. The language model is upstream, and the evaluations give you a sense of what the models are capable of. This is not necessarily the metric that you would evaluate if you were looking at the product, but I think it’s valuable to evaluate upstream metrics.
Later, when you get product experience, you look at whatever tasks and product-specific metrics and try to correlate them with the upstream metrics. Then you have (potentially) a good guide for understanding accuracy on MedQA. I know that this is not perfect, but maybe it’s a weak indicator that this model is actually better at medical knowledge, which is relevant for my domain.
AR: I like that you make it clear that HELM is a framework. There’s a benchmark that people think about, but HELM is really a framework for building and running benchmarks—which is quite powerful given that a lot of the real production and last-mile development is in using custom datasets and custom objectives.
In terms of upstream and downstream, I imagine a scenario in which downstream is user CSAT or feedback scores, and upstream are these generic, public benchmarks. Do you think there’s a middle ground—especially for enterprises that have their own sub-tasks or sub-datasets—that is missing for enterprise evaluation?
PL: What we’re evolving HELM into, besides the framework, is having this notion of “Helmlets” (or “Helmets,” we haven’t decided on the exact name), where we take slices of the use cases. So, for example, coding, or the medical domain, or the legal domain, or the financial domain, or for different languages. We have one where some folks from Hong Kong University have developed a Chinese evaluation, which we’ve integrated into HELM, and Professor of Computer Science Bo Li has developed this decoding trust benchmark, which is being integrated into HELM and which captures certain aspects of trustworthiness.
So we think about these “sub-domains,” if you will, which can capture the specific things that a user might be interested in. I think we’re going to start thinking about the customer service domain as well, working with some companies on that.
We hope to populate HELM with a wide variety of benchmarks that enterprises would care about. Then the task is less about coming up with a new benchmark and evaluating but just curating. A user’s approach would be: “I care about this, I care about that, and I care about that. Now, rank the models.”
Every good academic project has to have a properly goofy name.
Alex Ratner, CEO of Snorkel AI
AR: A lot of people have likened this ecosystem to a “family tree” of models. You’d have a lot generalist model lineages, but then you’d increasingly see this family tree of specialized models: specialized for the domain, the sub-domain, the specific group, the specific task, et cetera.
I can imagine evaluations would naturally track that. You’d have the generic ones, then the domain-specific ones, then successively more fine-grained ones as you get more precise.
It’s cool to see how that’s going to get supported. And I love the naming! Every good academic project has to have a properly goofy name.
PL: One more comment I want to make is that, in the era of foundation models, it’s an exciting time for evaluation. Before, in machine learning, you had to get a dataset, you had to hire a whole data team and annotate, and then you could divide and train. The ability to do few-shot learning or even zero-shot learning means that you can focus on just evaluation and you can get domain experts who will sit down and tell you what they want. Then you can rely on the general abilities or raw strength of the models to deliver something interesting.
So, I’m optimistic that we’ll see a lot more interesting evaluations coming online as these models get stronger. That’s exciting because that’s what we always wanted in machine learning: evaluation rather than being stuck with these synthetic or semi-synthetic datasets of the past, which everyone complains about.
AR: Yes. It goes back to your point about lowering the barrier for the first-mile exploration. You always want to be doing test-driven or evaluation-driven development. Start with focusing where the model is performing poorly. And now, you can get to that point much faster. The ability to get to the evaluation sooner and spend more time there, then do your development, your adaptation, your fine tuning, etc., driven by that evaluation, seems very exciting as a better development paradigm.
PL: Yes. Totally. We agree.
AR: Percy, thank you so much for spending the time with us and with this group today. You’re doing some awesome work there and the whole community is grateful for it and we really appreciate getting your thoughts on it today.
PL: Thanks for inviting me, Alex.
More Snorkel AI events coming!
Snorkel has more live online events coming. Look at our events page to sign up for research webinars, product overviews, and case studies.
If you're looking for more content immediately, check out our YouTube channel, where we keep recordings of our past webinars and online conferences.