Programmatic labeling: a complete primer

The founding team of Snorkel AI has spent over half a decade—first at the Stanford AI Lab and now at Snorkel AI—researching programmatic labeling and other techniques for breaking through the biggest bottleneck in AI: the lack of labeled training data. This research has resulted in the Snorkel research project and 50+ peer-reviewed publications. Snorkel’s programmatic labeling technology has been developed and deployed with Google, Intel, DARPA, Stanford Medicine, and more.

Snorkel Flow is a data-centric platform for building AI applications that we built to make programmatic labeling accessible and performant. Snorkel Flow is used by Fortune 500 enterprises such as Chubb, BNY Mellon, and several government agencies.

What is Programmatic Labeling?

Programmatic labeling is an approach to labeling that breaks through the primary bottleneck limiting AI today: creating high-quality training sets in a way that is scalable, adaptable, and governable.

The primary difference between manual labeling and programmatic labeling is the type of input that the user provides. With manual labeling, user input comes in the form of individual labels, created one by one. With programmatic labeling, users instead create labeling functions, which capture labeling rationales and can be applied to vast amounts of unlabeled data and aggregated to auto-label large training sets. This approach leads to a number of benefits over manual labeling:

Scalability: Once you have written a labeling function, no additional human effort is required to label your data—be it thousands or millions of data points—resulting in training datasets that are orders of magnitude larger and/or faster to create than those produced via manual labeling.

Adaptability: When requirements change, data drifts or new error modes are detected, training sets need to be relabeled. With a manual labeling process, this means manually reviewing each affected data point a second, third, or tenth time, multiplying the cost of both time and money to develop and maintain a high-quality model. When you produce your labels programmatically, on the other hand, recreating all your training labels is as simple as adding or modifying a small, targeted number of labeling functions and re-executing them, which can now occur at computer speed, not human speed.

Governability: When labeling by hand, users leave no record of their thought process behind the labels they provide, making it difficult to audit what their labeling decisions were—both in general and on individual examples. This presents a challenge for compliance, safety, and quality control. With programmatic labeling, every training label can be traced back to specific inspectable functions. If bias or other undesirable behavior is detected in your model, you can trace that back to its source (one or more labeling functions) and improve or remove them, then regenerate your training set programmatically in minutes.

What is a Labeling Function?

A labeling function is an arbitrary function that takes in a data point and either proposes a label or abstains. Nothing is assumed about the logic inside the function, which makes it a very flexible interface for incorporating domain knowledge from lots of different formats and sources.

Labeling functions can be quite diverse, ranging from simple heuristics such as looking for a specific keyword or phrase in a text field to more complex functions that wrap other models, perform a database lookup, or utilize embeddings.

Labeling functions do not need to be comprehensive or cover your entire dataset. For many applications, just having a few labeling functions per class is sufficient to create a training set with the information your model needs to perform well on your task as it generalizes beyond the labeling functions. Learn more.

Applying programmatic labeling to train ML models using Labeling Functions

Types of Labeling Functions

Labeling functions can come from a wide array of sources, including writing new heuristics (rules, patterns, etc.) or wrapping existing knowledge resources (e.g. models, crowdworker labels, ontologies, etc.). Below are some simple examples of the kinds of labeling functions that you could write for an email spam detector:

Programmatic Labeling vs. Manual Labeling

Most training data created today is manually labeled whether it is done internally or carried out crowdsourced services. However, organizations face the following key challenges with manually labeled training data:

Manual labeling is painfully slow. Even with an unlimited labor force or budget, it can take person-months/years to deliver necessary training data and train models with production-grade quality.
Annotation eats up a significant portion of the AI development budget. Manual labeling is inherently expensive as it scales linearly at best and is often error-prone. Data science teams rarely have adequate annotation budgets.
Complex data sets need subject matter expertise. Most training data requires highly trained experts, SMEs, to label, e.g., doctors, legal analysts, network technicians, etc., who often need to be well-versed in specific organization’s goals and datasets. However, available SMEs are limited, and expensive to label each datapoint manually.
Adapting applications often requires relabeling from scratch. Most organizations have to deal with constant change in input data and upstream systems and processes and downstream goals and business objectives—rendering existing training data obsolete. This challenge requires enterprises to relabel training data constantly.
AI Applications are hard to govern. Most organizations need to be able to audit how their data is being labeled, and consequently, what their AI systems are learning from. Even when outsourcing labeling is an option, performing essential audits on hand-labeled data is a near impossibility.

Programmatic labeling implemented with Snorkel Flow enables enterprises to:

Reduce time and costs associated with labeling with automation.
Translate subject matter expertise into training data more efficiently with labeling functions.
Adapt applications to data drifts or new business needs with few push-button actions or code changes.
Make training data easier to audit, manage, and build goveranable AI applications.

Programmatic Labeling Using Snorkel Flow

For over half a decade, the Snorkel AI team has applied the concept of programmatic labeling to a wide array of real-world problems and learned much over that time. We built the Snorkel Flow, a data-centric AI platform based on dozens of deployments of this technology with the goal of making this (and other state-of-the-art ML techniques) intuitive, performant, and accessible.

A few areas where Snorkel Flow facilitates full workflows based on programmatic labeling:

Explore your data at varying granularities (e.g., individually or as search results, embedding clusters, etc.)
Write no-code Labeling Functions (LFs) using templates in a GUI or custom code LFs in an integrated notebook environment
Auto-generate LFs based on small labeled data samples
Use programmatic active learning to write new LFs for unlabeled or low-confidence data point clusters
Receive prescriptive feedback and recommendations to improve existing LFs
Execute LFs at massive scale over unlabeled data to auto-generate weak labels
Auto-apply best-in-class label aggregation strategies intelligently selected from a suite of available algorithms based on your dataset’s properties
Train out-of-the-box industry standard models over the resulting training sets with one click in platform, or incorporate custom models via Python SDK
Perform AutoML searches over hyperparameters and advanced training options
Engage in guided and iterative error analysis across both model and data to improve model performance
Deploy final models as part of larger applications using your chosen production serving infrastructure
Monitor model performance overall and on specific dataset slices of interest
Adapt easily to new requirements or data distribution shifts by adjusting labeling functions and regenerating a training set in minutes.

Approaches for Dealing with Limited Training Data Other Than Programmatic Labeling —

Outsourcing / Crowdsourcing

You can make manual labeling somewhat more “scalable” by parallelizing it across many annotators. This usually means outsourcing to annotators for hire, either through companies that specialize in these services or through marketplaces where annotators self-select. The difficulty is in finding annotators with domain expertise and managing any privacy constraints that limit who can view the data to be labeled. It has the same limitations as in-house manual labeling—high costs, poor scalability and adaptability, and low transparency for auditing training labels.

Model-Assisted Labeling

Model-assisted labeling (also known as “pre-labeling”) is just what it sounds like: using a current model (or an existing model from some other task) to propose an initial label for your data points that a human can then approve or reject manually. This can be seen as another type of improved UI for manual labeling. For some tasks—particularly those where collecting an individual label takes a long time or a lot of button clicks (e.g., image segmentation)—this can save the time that it would have taken to make those clicks yourself. But the human is still viewing and verifying each example that receives a label, creating the labels one-by-one without associated functions for adaptability or governance, making it another marginal improvement of manual labeling.

Active Learning

Active learning techniques aim to make human labelers more efficient by suggesting the order in which examples should be labeled. For example, if time and money limit how many examples you can label, then you may prefer to label data points that are relatively more dissimilar from one another (and therefore more likely to contribute novel information) or those about which your model is least confident. In practice, this reordering of data points can boost manual labeling efficiency for some applications, particularly near the beginning of a project when a model has learned very little. As the model learns more, however, the problem of determining the most valuable points to label can be nearly as hard as the original problem, making gains over random sampling fairly marginal. Intuitively speaking: if you knew exactly where the decision boundary between classes was, you wouldn’t need to train a classifier to do exactly that. Finally, because the traditional active learning approach is still operating on the level of individual labels collected one at a time without associated labeling rationale, it has the same issues around indirect knowledge transfer, adaptability, and governance as the other members of the manual labeling family. In general, active learning is a classic, very standard, and very sensible way to make manual labeling more efficient—but it’s not a step change.

When Should Programmatic Labeling Be Used?

When training labels need to be collected quickly.
When lots of training data is required for best results (most models improve with more data, but especially deep learning models).
When labeling requires domain expertise that is difficult to crowdsource.
When privacy concerns prevent data from being shared externally.
When the task experiences frequent shifts in schema or product needs (and therefore requires frequent relabeling).
When model governance or transparency is a priority. If you detect a bias or error mode, programmatic labeling enables you to quickly identify what labeling sources are responsible and rapidly address the issue.

Use Cases of Programmatic Labeling

Programmatic labeling can be applied to most supervised learning problems. The Snorkel team has applied it to text data (long and short), conversations, time series, PDFs, images, videos, and more. The “labeling function” abstraction is flexible enough that the same workflow and framework applies in all cases—just swap out how you view your data, what types of labeling functions you use, and whichever model architecture is most appropriate for your data type.

Some of the use cases include:

Text and document classification
Information extraction from unstructured text, PDF, HTML and more
Rich document processing
Structured data classification
Conversational AI and utterance classification
Entity linking
Image and cross-modal classification
Time series analysis
Video classification

Where to Learn More about Programmatic Labeling

If you are interested in learning more about weak supervision and Snorkel Flow, check out the Snorkel AI website to see details and screenshots of the platform or to request a demonstration or a conversation with one of our machine-learning experts. We would be happy to go over the specifics of your individual use case and how weak supervision can be applied to accelerate your AI efforts.

Other Resources

Outsourcing / Crowdsourcing

Papers

Accelerate AI with Programmatic Labeling

It’s clear that programmatic labeling can be instrumental to not only automating the labeling process while keeping human-in-the-loop but also to accelerating AI development.

But where do you start? Rather than dealing with annotation guides and contracts for crowdsourced labeling, see Snorkel Flow in action.

With Snorkel Flow, Fortune 1000 organizations such as Chubb, BNY Mellon, Genentech, and more have built accurate and adaptable AI applications fast by putting the power of programmatic labeling to use.

Request a Snorkel Flow demo

Programmatic labeling

Snorkel Research Project

What is Programmatic Labeling?

What is a Labeling Function?

Types of Labeling Functions

Programmatic Labeling vs. Manual Labeling

Programmatic Labeling Using Snorkel Flow

Approaches for Dealing with Limited Training Data Other Than Programmatic Labeling —

Outsourcing / Crowdsourcing

Model-Assisted Labeling

Active Learning

When Should Programmatic Labeling Be Used?

Use Cases of Programmatic Labeling

Where to Learn More about Programmatic Labeling

Other Resources

Outsourcing / Crowdsourcing

Papers

Accelerate AI with Programmatic Labeling

Are you ready to dive in?

The Future of Data-Centric AI

June 7-8, 2023

Product

Solutions

AI applications

Industry use cases

Technology

Case studies

Resources

Company

Contact

Compliance