How Foundation Models bolster programmatic labeling
Snorkel AI co-founder and CEO Alex Ratner recently interviewed several Snorkel researchers about their published academic papers. In the video above, Alex talks with Mayee Chen about the work she did on improving the effectiveness of programmatic labeling through foundation models on both NLP and vision tasks. Below follows a transcript of their conversation, lightly edited for readability.
Mayee Chen: Okay. Hi everyone. My name’s Mayee Chen. I am a PhD student in the computer science department at Stanford, advised by Chris Ré working on some broad themes of understanding data-centric AI, weak supervision and theoretical machine learning.
Alex Ratner: I’m so excited to speak with you today. Today we’ll focus on the LIGER paper—I think that the title is “Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision.” This is an awesome paper of one of these intersectional studies of how do you take two major ideas around how you can be more data efficient and how you can actually accelerate data-centric AI and put them together in a really smart and principled way that actually has impact.
This was, I think, a UAI paper, if I’m not mistaken.
MC: Yes, that’s correct.
AR: So maybe you could tell us a little bit about the paper and the work.
MC: Yeah, sure. So I think the motivation here is that foundation models are like a very new and exciting paradigm.
These large pre-trained models such as GPT-3 and BERT, and CLIP and DALL-E, and they’re very exciting because they can allow you to construct models out of the box, with very limited labeled data or labeled examples. But the challenge there is that it’s not really clear how to apply these foundation models when we don’t have any labeled data at all.
Here we draw upon a lot of excellent recent work done in weak supervision, which is a nice programmatic way of producing labeled data from unlabeled data using weak sources like heuristics, crowd workers, or external knowledge bases. So, we’re interested in looking at the intersection of these two concepts: foundation models and weak supervision.
The nice thing here is—if you think about it—they offer complimentary sources of signal. So, foundation models, they’re pre-trained on huge corpora of data, and they have a lot of general information from the web or from these data sets. On the other hand, weak supervision practitioners are writing or specifying these sorts of functions that can have very tailored high precision, but low coverage signals. We can combine them and get the best of both worlds. That’s what this paper looks at, and I can go into more detail.
AR: First of all, just I think it’s an awesome idea. It’s these complimentary ideas. One idea that I think many of us have thought about with this kind of programmatic labeling or weak supervision work, is this idea of trying to bridge expert knowledge with statistical or data-driven knowledge.
You have something that a domain expert can say that’s very targeted and specialized at the task you’re trying to accomplish. Probably the foundation model is not specialized or fine-tuned for that task when you look at real-world things. But this expert knowledge, or we call it a labeling function, often is going to be very brittle.
Whereas you have these foundation models that are extremely good at generalizing. But they’re very general, right? They’re not targeted at the specific task. So using the one to complement the other or bridging between them, that’s one way that I had thought about your work and some of the bigger objectives here.
What’s an example of where you might want to use this kind of approach where you’re mixing programmatic labeling with foundation models?
MC: That’s a great question. So what you just said, I really agree with. We need additional information to often adapt foundation models to particular tasks.
In our setting, rather than looking at the full foundation model itself, which can very difficult or expensive to fully fine-tune, we’re just saying: “suppose we only have access to the foundation model’s embeddings.” And you can usually get those via some API for all these big models.
The ingredients we have are the foundation model’s embedding space—to think about a bit more theoretically—and then we have all these labeling functions with these weak sources of signal on our unlabeled data that are offering a vote on each data point for what it should be.
Our framework is very helpful in these settings where traditional weak supervision tends to fail a bit. Our approach to this problem of combining foundation model embeddings and week supervision was to say: “what are these main modeling challenges in week supervision? And can we use this nice embedding space to solve them?”
I’ll briefly go over the two challenges. The first one is that weak supervision typically models the accuracy parameters for each weak source at a very coarse level. This means that for each weak source, we give it a score that’s a complete scaler. And when we’re combining this source’s vote with other sources, we just give it that scaler weight regardless of what sort of data point you’re seeing.
This isn’t going to really capture variations in your dataset. If you think your dataset actually consists of two categories—or two subgroups—that have very different label distributions, then learning an accuracy perimeter on your entire dataset and not the subgroups is going to be a little bit imprecise. We tackle that by learning these clusters in the foundation models embedding space and providing those clusters as the subgroups—and basically learning a weak supervision model on each of those clusters.
The second challenge is that the weak sources tend to abstain a lot. Users will write very precise labeling functions, but they tend to only vote on very small sets of your unlabeled data. Oftentimes, the labeling functions will just abstain. This abstain rate can be really high, and if you have a point where all the sources you’ve specified abstain, then you’re in deep trouble.
So, we propose to do this sort of K-nearest-neighbors-type extension per source in the embedding space. We’re essentially propagating the votes to nearby points that have abstains. For instance, if we have a labeling function for sentiment that fires on the words “awful” and “terrible,” then it’s not going to catch the word “horrible.” But we know that “horrible” is very similar to “terrible.” These words are gonna be embedded very close together, and our method will capture these extra keywords or similar concepts that might have slipped by when the user was writing these labeling functions.
AR: That makes a ton of sense. If I could try to read it back to you quickly: take a canonical ML problem. Is this email spam or not? If you have a really challenging version of this problem, and you want to apply a model the classical way, you have to just label a bunch of data by hand. Now, the first jump that puts us in this kind of weak supervision world is to write functions or programmatically supervise.
So let’s ask our email spam expert to say: “What pattern should we be looking for? What labeling functions?” And now this email spam expert comes and says: “okay, I’m looking for the words ‘wire transfer.’ If I see ‘wire transfer’ in the email, I think it’s more likely to be spam.”
And then we have these two challenges. Where is that a good labeling function? Where is it a bad one? Maybe we learn that it’s actually a really good indicator of spam in general email, but in certain emails, it’s not. It’s actually quite the opposite. It’s actually quite inaccurate as a signal of spam.
Second, we also want to handle more generality of the patterns. Like maybe it’s not just “wire transfer,” it’s “money order,” “cash…” there ends my knowledge of the financial transfer system. But is that a fair capture? We take in these programmatic labeling inputs, how do we know where to trust them at a more granular level? And how do we generalize beyond the brittle pattern by which they’re expressed originally? You’ve been able to solve that using these embeddings from foundation models with principled underpinnings in this work.
Is that a fair read-back?
MC: Yeah, that’s completely correct.
Your example is really good. That’s basically what we do, and the nice thing about the principled aspect of it is that we have theoretical guarantees for our entire framework. The key thing they rely on is this notion that the embedding space is smooth with respect to the task you’re evaluating on.
We have this quantity we can measure that corresponds to how well aligned your foundation model’s embedding space is with the task you’re looking at. We took the metric, and we looked at it for different sorts of embeddings. For some vision tasks, we looked at CLIP embeddings versus raw pixel space and we confirmed CLIP embeddings give us the best version of our framework, and they are also the smoothness with respect to the tasks we evaluate on.
It’s a very cool theoretical connection there.
AR: Yeah, that was super cool. Usually, the only mapping you have in a lot of these bounds is example complexity or a number of data points, and here you actually have this kind of curvature or embedding space parameter that you can actually map to real-world datasets. Super cool to have that tighter mapping to the theoretical bounds.
Mayee, thank you so much for taking the time today to talk about your awesome work.
MC: Yeah. Thank you so much for having me.
You can register for a live demo of Snorkel Flow on February 16 which will feature the platform’s new FM capabilities.