Prompting and weak supervision to build better, smaller models
Snorkel AI co-founder and CEO Alex Ratner recently interviewed several Snorkel researchers about their published academic papers. In the video above, Alex talks with Ryan Smith, Senior Applied Scientist at Snorkel, about the work he did on using foundation models such as large language models to build compact, deployable, and effective models. Below follows a transcript of their conversation, lightly edited for readability.
Alex Ratner: Lots of cool work going on in your neck of the woods. Obviously, I’m a little biased. But I think, objectively, it’s a very defensible statement. Today, I’ll focus on one paper that you and the team recently posted called “Language Models in the Loop: Incorporating Prompting into Weak Supervision.” At a high level, it’s a pretty cool example of using foundation models to train other models rather than directly trying to plug them into production—which is a pretty interesting new line of work that you all posted back in May. Why don’t you tell us a little bit about it?
Ryan Smith: Yeah, that was a really cool experience. I think right before we started work on that I gave a little ML whiteboard here. That’s on Youtube. I think you can find it, but it’s basically just going over some prompting methods. And at the end of it, there was a little cutaway segment about how in the future we could put this into the weak supervision pipeline. And as the zero- and few-shot methods get more robust, here’s where we think they’ll end up.
Right after that, we had a talk with some of the research scientists at Snorkel, and we decided to run some experiments around it. It started as saying: “Okay, let’s use a few zero- and few-shot methods as labeling functions in a weak supervision pipeline just to see if this will boost our performance. Is there signal to be gotten from here?”
We ended up using the same model, T0++ out of Big Science, and we were able to ask it a bunch of different questions with complimentary domain knowledge. And then each of those questions ended up boosting performance in a really cool way.
We ended up with a model that was better than if you coded in these rule-based decisions yourself. The motivation behind it is, I would say there are a few different reasons, but two prominent ones. One is that in your typical zero- or few-shot setting, you’re not really able to adapt those models to either subsets of your data or edge cases that you find as your model is in production. When you instead put those models into a weak supervision pipeline, it gives you a lot more flexibility to still get the good parts of those models—where they do well, they still do well—but you’re also not glued to the failure modes that they have. You can write other labeling functions to cover your tracks and then adapt from there.
Then the other big point is this allows you to train a much smaller model in production—which I think is a huge win for everyone—and cut down your deployment costs.
AR: There’s a lot of awesome stuff to unpack there. I’ll start backwards with the motivation. You mentioned this idea of being able to correct and refine and adapt these foundation models, beyond zero-shot or push-button methods. Second, rather than trying to deploy them directly, which most organizations that we work with can’t do, you’re using them to supervise smaller, deployable, specialist models. Is that a fair breakdown of the motivations?
RS: Absolutely. Yeah. Especially on that deployment point.
In my experience, when you’re in deployment, you’re seeing orders of magnitude more data. That’s where your main inference cost is going to be. So, when applying these models, even large data sets is still just a drop in the bucket of how much you would end up spending if you serve that in production.
AR: And governance and other constraints as well. A lot of folks I know can’t just deploy GPT- 3 to production—for, frankly, good reasons around cost and inference time, but also our ability to put governance constraints around them and understand what’s going on and what we can guarantee. So, I really like this idea of this line of work that you and the team started around trying to use foundation models to train smaller, deployable, specialist models rather than trying to deploy directly.
Let’s maybe unpack that second idea that rather than a push-button zero-shot approach, you’re asking multiple questions or prompts of the foundation model which allows you to tune or narrow, or refine what is actually coming out of it. Maybe you can walk us through an example of what that would actually look like in a real application.
RS: Yeah. This was something that I think was kind of a surprise.
To go down to a lower level view, imagine that you’re trying to classify one of the data sets that we use in the paper, Youtube comments, as either spam or not spam. You could straight up ask the model: “Okay, is this comment Spam?” and the model will give you an answer—depending on how well it knows what the spam token actually refers to and how much it picks up from the comment itself.
What you could do instead is, you could ask some leading questions in the way of weak supervision. I, as a human with knowledge about what spam comments are likely to include, could say something like: “Is this comment asking me to take an action, like click a link?”
Then the model might interpret taking an action as totally uncorrelated with spam. But now we’re able to make that connection ourselves, instead of relying on the weight of the neural network to have that already coded in. So, you can ask a few different questions that are all related to spam. “Does this ask me to take an action?” “Does this ask me to listen to a song or something?”
“Does this comment express a strong sentiment?” would indicate that it’s not spam. So you start encoding your domain knowledge that way, which ended up being really cool.
As I alluded to, there were a bunch of hard-coded rules that we were mirroring for this domain knowledge. Wrench is a great open-source resource that I know—Alex, one of your students was the one that developed it—and it has a benchmark of all these weak supervision tasks with hard-coded LF rules in there. We actually took those hard-coded LF rules, converted them to prompts for the model, and the resulting end model ended up doing better in two out of three cases, which was really cool to see.
AR: That’s super cool. So, what you’re saying is, by using your domain expertise to make multiple targeted prompts or questions of these foundation models and then combining them in smart ways, using weak supervision, you can actually get better results in smaller, more deployable models than if you just did the push-button approach of using a foundation model. And that’s super cool.
We just did a couple of these mini-discussions right before this, and there’s a common theme that really comes down to: “how do we combine these big foundation models with domain expert knowledge that can get more out of them?” In the case of your work, “how do you then also get into a deployable format?” The smaller model that you can actually ship in an enterprise setting. So, super cool stuff.
Thank you so much, Ryan.
RS: Yeah, thank you.
You can register for a live demo of Snorkel Flow on February 16 which will feature the platform’s new FM capabilities.