With models like Stable Diffusion and ChatGPT going viral in recent months, interest in using foundation models for production AI development has skyrocketed. Snorkel recently hosted a Foundation Model Virtual Summit, where researchers from academia and industry discussed recent findings in this fast-moving space, and highlighted how data-centric workflows allow ML practitioners to translate the potential of foundation models into tangible, real-world benefits.

As we’ve seen Snorkel Flow’s latest Foundation Model capabilities be applied to a diversity of customer problems, we’ve highlighted a number of key benefits from combining foundation model outputs with weak supervision, including:

  • Faster model development
  • Fewer ground truth labels needed
  • Lower inference costs

In this post, we’ll discuss one additional benefit of using foundation model outputs in a data-centric workflow: the ability to quickly and easily combine outputs from many different foundation models in the same task.

Buttons on soundboard, our analogy for combining foundation models with weak supervision
Image credit: Anthony Roberts

A Growing Number of Foundation Models

The number of available open-source models has exploded. Hugging Face, a popular repository of open-sourced foundation models, now contains over 138,000 distinct models, and developers add more every day. These models are optimized for a variety of tasks (e.g., text classification, question answering, object detection) and have been trained on datasets, ranging from general sources like Wikipedia to domain-specific datasets like PubMed or legislation and court cases.

Without weak supervision, AI developers have a couple of options for using foundation model outputs:

  1. Identify a single foundation model best suited for their task.
  2. Develop a custom way to combine outputs from different foundation models.

With weak supervision, AI developers have a third option: they can create labeling functions from different foundation models and let Snorkel Flow’s label model determine the best way to aggregate and de-noise their outputs.

Example: News Classification

To illustrate the value of combining multiple foundation models, let’s consider an application focused on classifying news articles into the following categories: Politics, Sports, Business, and Other.

Even for this straightforward task, a number of different foundation models could provide useful signal for programmatically labeling a large training set.

Foundation Model #1: bart-large-mnli

Bart-large-mnli is optimized for zero-shot classification—the ability to assign data to specific classes without providing any ground truth labels. In a previous post, we discussed research that showed how combining human-generated labeling functions (LFs) with bart-large-mlni LFs substantially improved model performance.

Bart-large-mnli takes in a passage of text and a set of potential classes, returning probabilities that an individual article belongs to each of the potential classes.

A few examples show how this model produces reasonable results to jump-start the performance of our classifier:

From just a few examples, we see that bart-large-mnli can contribute to our model, but may fall short as a sole source of information. It isn’t that confident about its answers, and failed to classify an article about trade as “Business.”

However, using weak supervision, we can create labeling functions with this information and continue finding additional useful patterns to improve the quality of our training set.

Foundation Model #2: t0pp

t0pp (t-zero-plus-plus) was trained on a large number of natural language tasks across a number of domains to give it a flexible, general set of capabilities. As a result of this training, it has memorized information about common entities useful for news classification, as the following sets of prompts and responses illustrate:

Who is Joe Biden?Vice President of the United States
Who is Andy Murray?British tennis player
What is Apple?A computer company

Like all foundation models, t0pp sometimes returns dated or inaccurate answers; Joe Biden, after all, is no longer Vice President of the United States. But it offers enough information to extract useful knowledge for our news classification task by posing it as a multiple-choice question, one of the tasks that t0pp was trained on. Here are a couple of examples:

What is the following article about? Select from Politics, Sports, Entertainment, Business, and Other. {Text from the article Biden Promises Federal Government Will Assist Storm-Ravaged California}Politics
Prompt: What is the following article about? Select from Politics, Sports, Entertainment, Business, and Other. {Text from the article Andy Murray wins another five-set epic}Sports

Creating a labeling function from t0pp would provide additional signal on top of what bart-large-mnli provides for our model, improving our end model without having to spend any time determining how to best combine the outputs from the different models.

Foundation Model #3: distilbert-base-cased-distilled-squad

Distilbert-base-cased-distilled-squad is an extractive question answering model: given a question and a passage of text, the model will determine the piece of text within that passage that best answers the question, along with a score indicating how confident it is that the extracted text answers the question.

Although this model is more typically used for information extraction tasks (e.g., detect all companies mentioned in a news article), the flexible nature of labeling functions allows us to extract usable signal for our news classification task.

Suppose that after an initial round of model development using foundation models and other sources of knowledge, we detect a common error mode: articles discussing news about large companies are often classified as “Other,” even though they should be classified as “Business.” To remedy this error, we could create a labeling function using distilbert-base-cased-distilled-squad by posing the question “What is this company?” If the model returns a result with high confidence, our labeling function will vote to label the article as “Business,” as we’ll know that there is a strong chance that the article is discussing a particular company.

For example:

  • Asking “What is the company?” with the first few paragraphs of this recent business article returns “FTX” with high confidence (0.754).
  • Asking “What is the company?” with the first few paragraphs of this recent sports article returns a nonsensical answer of “Nike-tennis get-up” with a low confidence score (0.106).

When combined with results from other foundation models, using an extractive question-answering model in a targeted fashion can sometimes help address specific types of errors.

Foundation Model #4: bart-large-cnn

The bart-large-cnn model was fine-tuned on news articles—very relevant for a news classification application—to generate short summaries. Although summarization models aren’t typically used to help with classification, they can be useful for targeted model refinement.

Suppose that we discover another common error mode in our model: when news articles mention a sport incidentally (e.g., describing a celebrity as a “former football player”), our model sometimes mistakenly labels the article as “Sports” when it should be one of our other categories.

To address this, we could create a labeling function based on the output of bart-large-cnn: if the summary of the news article mentions a set of common sports (e.g., tennis, baseball, basketball, football), then vote to label the document as Sports. This approach would take advantage of how summarization algorithms distill an article down to its primary focus, providing a way for tell the model when a sport is or is not the main subject of an article.

As an example, here’s the summary this model produces from a recent article about a tennis tournament:

“Sir Andy Murray beats Thanasi Kokkinakis in five sets at the Australian Open. Murray’s win, 4-6, 6-7 (4), 7-6 (5), 6-3, 7-5, began Thursday and finished Friday at 4:05 a.m. It was the third-latest recorded finish in the history of professional tennis.” (emphasis added)

Labeling Functions would allow us to use the presence of a known sport (tennis) in this summary to more confidently label this as a Sports article.

Foundation Model #5: GPT-3

GPT-3 is best known as the original backbone of ChatGPT, but the underlying model is still available and useful. OpenAI trained GPT-3 on a diverse corpus of roughly 500 billion tokens sourced from five different data sets, and the off-the-shelf version performs remarkably well on a variety of tasks.

Building labeling functions with GPT-3 would closely mirror how you would use t0pp: feed it text followed by a request to assign the text to one of the specified categories. GPT-3 occasionally returns the wrong result, and can sometimes be sensitive to tiny changes in language, but it will be right more often than it’s wrong.

A GPT-3 labeling function would provide additional signal, but it may not be worthwhile. In benchmark tests, t0pp achieved comparable performance to GPT-3. And, at one-sixteenth the size, t0pp runs faster and cheaper.

Aggregating Diverse Signals With Weak Supervision

Weak supervision’s strength has always been the ease with which it aggregates diverse signals: Snorkel Flow’s label model integrates pretty much any signal one could have about data, from keyword and regular expression patterns to GPT-3 outputs and knowledge bases.

The explosion of foundation models has only increased weak supervision’s and Snorkel Flow’s value. The platform lets users combine a variety of foundation models for a single task without needing to identify the best individual model or determine a custom way to combine them. AI developers can instead remain focused on improving the quality of their data, locating systematic error, and finding the right source of information to address the errors and improve their model’s performance.

Learn more

See what Snorkel can do to accelerate your data science and machine learning teams. Book a demo today.

Matt Hoffman is a machine learning engineer with more than a decade of experience in ML, research, data science, and analysis. Matt authored this article during his time as an MLE at Snorkel AI. Connect with Matt on LinkedIn.