Contrastive Learning boosts Foundation Model specialization
Snorkel AI co-founder and CEO Alex Ratner recently interviewed several Snorkel researchers about their published academic papers. In the video above, Alex talks with Ananya Kumar about the work he did on improving the effectiveness of foundation models by using contrastive learning, image augmentations, and labeled subsamples. Below follows a transcript of their conversation, lightly edited for readability.
Alex Ratner: You have a ton of really exciting work that you’ve been putting out there, but I’m going to focus on just one paper today. So this is “Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation.” I’ll kick it off by asking if you could share a little bit about the paper and the motivation for studying this problem and the context (especially as it connects Foundation Models to different data domains and the way data plays a role in that), and then a little bit about what you did and what the results were.
Ananya Kumar: Sounds great. I think one aspect of these Foundation Models like BERT, GPT-3, or SimCLR, that I found very cool is how robust they are. If you take a foundation model, like say take SimCLR or CLIP, and you fine-tune these models on data from one domain, like data from North America, and you test them on other domains where you’ve only seen unlabeled data—so you test them on new countries—it does remarkably well. It leads to state-of-the-art results on a lot of these real-world robustness data sets like satellite data and wildlife conservation. I was really interested in understanding how does this happen? How does it lead to such good robustness? And does the data you pre-train on play a big role?
We studied this in a controlled setting where you have unlabeled data from a few domains and labeled data from one domain. You pre-train on all the unlabeled data. So you can imagine you pre-train on satellite data around the world and then you fine-tune on data from North America, and then you test it everywhere in the world where labels might be scarce.
We found that this works really well and it works very differently from conventional intuitions in the domain adaptation literature. It doesn’t merge the features or representations learned from different countries or domains. Instead, if the data satisfies some good properties, it disentangles the information from different domains and the class information you care about.
AR: That’s super interesting. What does the work illuminate in terms of where this works and where it falls short? In other words, if you’re just looking at it from a practitioner’s lens, and say I have a model that’s pre-trained on all of the sub-distributions that I might care about in an unsupervised or self-supervised way, and I want to try this out-of-domain, zero-shot generalization, where can I expect it to work versus where should I expect that I’m actually going to have to label data on the new sub-distribution?
AK: So I think the highest-level point is you need unlabeled data from the domain you care about. If you don’t have any unlabeled data—if the data you’re testing on is just, I don’t know, Mars, and you don’t have any data from that—then you can’t do it.
The second thing we look at is this method called contrastive learning. The key player there is augmentations. You take augmentations of the same image, like tiny crops of the same image, and you map them to similar representations and augmentations of different images.
You push them apart, and basically these augmentations, you can view them as connecting different domains. You can imagine that there are images in the US and images in Asia, that if you apply augmentations, they could look similar. The key property you need is some connectivity between these domains that are induced by the augmentation.
Basically, if you have data from North America, and no matter how you augment it, it can never look anything like data from Asia, then you’re not gonna expect this method to work well. But if there is some connectivity, then you can expect it to work well. There are some more technical details, but that’s the gist of it.
AR: So, one implication, just to echo back what you said, is that these kinds of approaches or this kind of framing is applicable when you have access to all the unlabeled data, but you’re bottlenecked on the labeling of the data, right? It’s not like you’re going from Earth to Mars or from Reddit to complex medical documents, but it’s more like you have both Reddit and PubMed in your unsupervised training set, but you only had labeling for one of those subdomains.
Many settings have plentiful access to unlabeled data, and I think we’re gonna see much more of that as the architecture—especially in Foundation Model world—opens up and we find ways to build smaller ones and build them on-prem and build them on specialized data sets. You’re going to see probably more settings where you have access to the unlabeled data, and you’re just blocked on the labeling of the data. Is that what you have in mind?
AK: That’s exactly right. Yeah. We’re envisioning that you have tons and tons of unlabeled data—so like satellite data from around the world, they keep going around collecting data. Or like cameras around the world collecting wildlife conservation data, or data from all across the internet.
AR: And you have access to trained foundation models on this data. So you’ll have the unsupervised data represented in the training set. But you won’t have necessarily labeled data for the specific target task you’re trying to accomplish.
AK: Exactly.
AR: That, seems super relevant and definitely mirrors what at least we’re seeing from our perspective. And then the second element is super interesting.
We did a little work on this back in 2017. We thought about this idea of transformation functions for data augmentation. You know, can we figure out automatically the right amount of degrees to rotate the digits to improve the performance or the right kind of like crops to apply? A lot of what this came down to was automated in terms of applying and tuning the optimizations, but it did require someone to engineer the types of transformation functions, and the types of core augmentations, like you said. What types of building blocks, what types of augmentations, crops, rotations, shears, dictionary replacements, whatever it might be, are gonna be able to get you from one domain to the other?
It’s a little bit of a leading question, but in your experience carrying out this research, how much does this require or rely on engineering of these augmentations to, in your language, connect these spaces, versus how much of it can be done just out of the box?
AK: We mostly looked at like manually choosing these augmentations, but currently we’re working on automatically using these criteria to optimize the augmentations you choose. So I think that’s a great line for future work.
AR: Very cool. Very cool. Well, I’m sure you’ll talk about that much more in the summit talk coming up, on June 17. Ananya, thank you so much for joining us and for sharing more on your awesome work.
AK: Yeah. Thank you so much for having me, Alex.
You can register for a live demo of Snorkel Flow on February 16 which will feature the platform’s new FM capabilities.