An exciting chat between Alex Ratner, co-founder and CEO, Snorkel AI, and Simran Arora, ML researcher, Stanford University, about new research from Hazy Research on how prompting methods enable a 6B parameter model to outperform the 175B parameter GPT-3.

Snorkel AI co-founder and CEO Alex Ratner recently interviewed several Snorkel researchers about their published academic papers. In the video above, Alex talks with Simran Arora about the work she did on improving the effectiveness of foundation models using an approach centered on “Ask Me Anything”-style questions. Simran spoke at Snorkel’s Foundation Model Virtual Summit on January 17. Below follows a transcript of their conversation, lightly edited for readability.

Alex Ratner: You have a, a whole bunch of work both out there in the literature and in the works. But I thought we would stay on the AMA paper, “Ask Me Anything, A simple strategy for prompting language models.” So maybe just to kick it off, would you mind telling us a little bit about the method that you’ve developed here, but also, some of the motivations and the bigger picture contextualization and also some of the awesome results that you got?

Simran Arora: Yeah, for sure. So, AMA is a recent work. Just to provide a little bit of background, foundation models are amazing. We’re super excited by their potential. With these models, both ML and non-ML experts can express our goals to the model via natural language specifications of our task, which are commonly referred to as prompts.

The ability to prompt foundation models—to perform a range of tasks with no additional training requirements—allows us to really rapidly prototype new ideas and build apps in hours that might have previously taken years. But writing these natural language prompts can be quite a brittle process. Small modifications to the prompt—which can include the types of demonstrations of our task, or even small things like formatting of the prompt—can just cause very large variations in the types of predictions that the model will make.

And so it’s really painstaking to try to get prompts that are of high quality, and when you’re going to even smaller foundation model variants, that brittleness only increases. To mitigate this high degree of effort in prompting, we propose this method called Ask Me Anything (or AMA for short), which at a high level applies multiple decent, but ultimately-noisy, not perfect prompts to each inference example and aggregates over their predictions using weak supervision to produce the final result. In the work, we report both on what makes for a decent, effective prompt and also how to reliably aggregate over the multiple prompts.

Just to highlight the headline results, AMA surpasses the few-shot prompted 175 billion-parameter GPT-3 model from OpenAI on 15 popular benchmark tasks with an open source 6 billion parameter model. 

Beyond just the comparison between the 6 billion and the 175 billion model, one other thing that we really wanted to highlight is we studied the benefits of AMA on 14 open-source language models, and found significant boosts across the board. 

This is really important because, despite the promise of foundation models, there is a major scalability challenge. Performance really tends to get better when you have larger foundation models. These are costly for most researchers and many organizations to host locally at the 175 billion scale. Models that are only accessible via APIs, like OpenAI models, are really costly, and also hard if you have private data—like a private patient medical record where you can’t just send this to the API. So, we’re really excited about getting small open-source foundation models to be better without sacrificing performance. 

AR: That’s an awesome framing and objective. One thing that I wanted to jump in on for a bit is the advent of prompt engineering and how you view that. I’ll draw a quick parallel back to some of the pre-prompting, pre-large language model week supervision work that you’re deeply familiar with that we worked on in the lab. Some of that came from this transition around training data labeling—viewing it as this thing that you did ad hoc or someone else did to something that was actually an engineering task, and then to how do we partially automate or assist or accelerate or optimize that training data engineering task.

Chris [Ré] would say he’s bored of it by now, but some of us are still pushing in that direction. Now we have this similar variant around these prompts that unlock these zero-shot capabilities of these foundation models where first it’s just kind of ad hoc, and now we’re actually hearing people talking about it as an engineering activity—one, to your point, that’s quite brittle and difficult to get to work. Would it be fair to position your work as saying “yes, there is gonna be prompt engineering and we can accelerate and partially automate it with principle techniques like the chaining, and weak supervision modeling and templatization” that you contributed here? Or do you view it a different way?

SA: I’ve looked at a range of applications where users have wanted to use prompting for different data exploration tasks and data management tasks, and there are a set of tools that are more principled and can consistently provide improvements.

For example, there are great calibration style works by Berkeley. The weak supervision style approach, there are some underlying fundamental principles here. With prompt engineering, another trend that exists is that, with instruction fine-tuning, models are getting better at following instructions with even less effort.

Even with ChatGPT, some people find it to be very verbose or have other quirks that you don’t necessarily like. Even as new models that are somewhat better come out, there are still these needs to guide them and engineer them in specific ways.

So, I do see this need for prompt engineering continuing, even as we have these different varieties of models come out and we see progress on foundation models. 

AR: Yeah. Makes a lot of sense. In other words, it likely is something that we’re gonna be seeing more of, but it’s probably not gonna be someone crowdsourcing on Twitter the right way to phrase a prompt. It’s gonna probably be more of building on top of principled approaches for tuning, optimizing, and composing them like the AMA contribution. 

It’s super exciting. I’m sure we’re gonna see a ton more coming from you and the folks in the lab, with these prompt engineering, data-centric ideas applied to foundation models. Super exciting area. 

And Simran, thank you so much for taking the time. 

SA: Thanks, and I just wanted to close out with a quick shout-out to the amazing weak supervision work that you and others at Snorkel have done that we really built on in AMA and then also my great collaborators of Avanika, Mayee, Laurel, and others from Hazy Research.

AR: Awesome. Well, thank you so much and I’m really excited to see everything that all of you have coming out soon and to see your talk at the summit.

You can register for a live demo of Snorkel Flow on February 16 which will feature the platform’s new FM capabilities.