Essential Guide to Weak Supervision

The founding team of Snorkel AI has spent over half a decade—first at the Stanford AI Lab and now at Snorkel AI—researching weak supervision (WS) and other techniques for breaking through the biggest bottleneck in AI: the lack of labeled training data.

This research has resulted in the Snorkel research project and 150+ peer-reviewed publications. Snorkel’s technology which applies weak supervision has been developed and deployed with Google, Intel, DARPA, Stanford Medicine, and more.

Snorkel Flow is a data-centric platform for building AI applications that we built to make weak supervision accessible and performant. Snorkel Flow is used by Fortune 500 enterprises such as Chubb, BNY Mellon, and several government agencies.

What is weak supervision and how does weak supervision work?

Weak supervision is an approach to machine learning in which high-level and often noisier sources of supervision are used to create much larger training sets much more quickly than could otherwise be produced by manual supervision (i.e. labeling examples manually, one by one).

If you have high-level, scalable, but potentially noisy sources of signal, you can combine them using multiple sources of supervision. At Snorkel AI, we use labeling functions to do this.

By observing when and where these different labeling functions agree or disagree with one another, you can automatically learn—in unsupervised ways—when, where, and how much to trust each of them. You can thus learn their areas of expertise, and the overall level of expertise, so that when you combine their votes you end up with the highest quality label possible for each data point.

A diagram giving an overview of how weak supervision works.

When should weak supervision be used?

Weak supervision enables the creation of very large training sets very quickly. If your particular problem would be better addressed with 100,000 “pretty good” data labels, compared to 100 “perfect” data labels, it may be worth looking at higher-level interfaces for gathering more data.

Additionally, weak supervision is great to use in any situation in which you need to adapt and iterate regularly and rapidly. If there are frequent shifts in the distribution of your data, such as in an adversarial setting (such as fraud detection) or just because your needs frequently change, weak supervision enables you to do anything from adding novel classes to incorporating and reflecting new realities about your problem.

Weak supervision vs. rule-based classifiers

Weak supervision has some similarities—and some very important differences—to rule-based classifiers. The obvious similarity is that the inputs to each look like rules (i.e., simple functions that output labels or predictions). The important difference between them is that the rule-based classifier stops there—the rules are the classifier. Such systems are generally brittle because they do not generalize to other examples, even ones that are very similar to those that are labeled by one or more rules.

With weak supervision, on the other hand, the rules (or “labeling functions”) are used to create a training set for a machine-learning-based model. That model can be much more powerful, utilize a much richer feature set, and take advantage of other state-of-the-art techniques in machine learning, such as transfer learning from foundation models. As a result, the model is generally much more robust than a corresponding rule-based classifier.

Each labeling function suggests training labels for multiple unlabeled data points, based on human-provided subject matter expertise. A label model (Snorkel Flow includes multiple variants optimized for different problem types) aggregates those weak labels into one training label per data point to create a training set. The ML model is trained on that training set and learns to generalize beyond just those data points that were labeled by labeling functions.

How snorkel flow makes weak supervision practical

Snorkel AI has applied weak supervision to many problems over the years and we have learned a lot about which features and workflows make it most accessible and practical for users. We built Snorkel Flow specifically with that experience in mind.

Snorkel Flow is a data-centric platform for building AI applications powered by weak supervision and other modern machine learning techniques. In Snorkel Flow, users manage data throughout the full AI lifecycle by writing simple programs (labeling function) to label, manipulate, and monitor training data. These programmatic inputs are modeled and integrated using theoretically-grounded statistical techniques, made accessible to both developer and non-developer users alike via both a no-code UI and Python SDK.

Snorkel Flow provides you with the ability to easily express many different types of signal, whether that is importing existing labels or models that you already have and applying them, or allowing you to write new labeling functions that are rule- or heuristic-based. Snorkel Flow then gives you access to the label-model algorithms that we have developed. They automatically combine these different scalable (but potentially noisy) sources of supervision to create high-quality labels for each of your data points.

Lastly, Snorkel Flow equips you with ready-made infrastructure for the application of these functions. The platform guides you through the process and supplies integrated model training so that you can loop back and make adjustments yourself to your weak supervision sources as you go.

Use cases for weak supervision

Weak supervision can be applied to many problems. The Snorkel AI team has applied it to text data (long and short), conversations, time series, PDFs, images, videos, and more. So long as domain-relevant resources exist or labeling heuristics can be described, weak supervision can be applied. Some of the use cases include:

Some of the use cases include:

Text and document classification
Information extraction from unstructured text, PDF, HTML and more
Rich document processing
Structured data classification
Conversational AI and utterance classification
Entity linking
Image and cross-modal classification
Time series analysis
Video classification

How is weak supervision different from other approaches to machine learning?

Unsupervised learning

Unsupervised learning uses no labels. It can often be useful for identifying structure and clusters, etc., but it is not enough to train a classifier on its own.

Transfer learning

Transfer learning (TL) takes the knowledge you have gained in the pursuit of one task and applies it to another task. It is a very common approach found in nearly every pre-trained model that you want to fine-tune for your own purposes. But while pre-trained models are generally a good starting point, they are rarely an appropriate “end point” ready to perform well on your task out-of-the-box.

Semi-supervised learning

At first glance, semi-supervised learning (SSL) is quite similar to weak supervision. It uses a small amount of labeled data and a lot of unlabeled data to train a model. The primary difference, though, is that semi-supervised learning propagates knowledge (“based on what is already labeled, label some more”) whereas weak supervision injects knowledge (“based on your knowledge, label some more”). In a sense, semi-supervised learning “smooths the edges” of what you already know, as opposed to weak supervision which discovers and addresses new uses.

Zero-shot learning

With zero-shot learning (ZSL), you train your model on a task in which almost infinite training data can be created—for example, for all the usable text on the internet, blank out words to predict from context (this is called “language modeling”). In the process, the model learns lots of interesting things. It is often a surprisingly good approach out of the box, but rarely gets you through the “last mile” of your task.

While none of the above techniques are a replacement for weak supervision, they are all fully compatible with it, and all are made available in Snorkel Flow.