The founding team of Snorkel AI has spent over half a decade—first at the Stanford AI Lab and now at Snorkel AI—researching programmatic labeling and other techniques for breaking through the biggest bottleneck in AI: the lack of labeled training data. This research has resulted in the Snorkel research project and 150+ peer-reviewed publications. Snorkel’s programmatic labeling technology has been developed and deployed with Google, Intel, DARPA, Stanford Medicine, and more.

Snorkel Flow is a data-centric platform for building AI applications that we built to make programmatic labeling accessible and performant. Snorkel Flow is used by Fortune 500 enterprises such as Chubb, BNY Mellon, and several government agencies.

What is Programmatic Labeling?

Programmatic labeling is an approach to data labeling that breaks through the primary bottleneck limiting AI today: creating high-quality training sets in a way that is scalable, adaptable, and governable.

The primary difference between manual labeling and programmatic labeling is the type of input that the user provides. With manual labeling, user input comes in the form of individual labels, created one by one. With programmatic labeling, users instead create labeling functions, which capture labeling rationales and can be applied to vast amounts of unlabeled data and aggregated to auto-label large training sets. This approach leads to a number of benefits over manual labeling.

Benefits of programmatic labeling

Scalability: Once you have written a labeling function, no additional human effort is required to label your data​​—be it thousands or millions of data points—resulting in training datasets that are orders of magnitude larger and/or faster to create than those produced via manual labeling.

Adaptability: When requirements change, data drifts or new error modes are detected, training sets need to be relabeled. With a manual labeling process, this means manually reviewing each affected data point a second, third, or tenth time, multiplying the cost of both time and money to develop and maintain a high-quality model. When you produce your labels programmatically, on the other hand, recreating all your training labels is as simple as adding or modifying a small, targeted number of labeling functions and re-executing them, which can now occur at computer speed, not human speed.

Governability: When labeling by hand, users leave no record of their thought process behind the labels they provide, making it difficult to audit what their labeling decisions were—both in general and on individual examples. This presents a challenge for compliance, safety, and quality control. With programmatic labeling, every training label can be traced back to specific inspectable functions. If bias or other undesirable behavior is detected in your model, you can trace that back to its source (one or more labeling functions) and improve or remove them, then regenerate your training set programmatically in minutes.

What is a labeling function?

A labeling function is an arbitrary function that takes in a data point and either proposes a label or abstains. Nothing is assumed about the logic inside the function, which makes it a very flexible interface for incorporating domain knowledge from lots of different formats and sources.

Labeling functions can be quite diverse, ranging from simple heuristics such as looking for a specific keyword or phrase in a text field to more complex functions that wrap other models, perform a database lookup, or utilize embeddings.

Labeling functions do not need to be comprehensive or cover your entire dataset. For many applications, just having a few labeling functions per class is sufficient to create a training set with the information your model needs to perform well on your task as it generalizes beyond the labeling functions.

Applying programmatic labeling to train ML models using Labeling Functions

Types of labeling functions

Labeling functions can come from a wide array of sources, including writing new heuristics (rules, patterns, etc.) or wrapping existing knowledge resources (e.g. models, crowdworker labels, ontologies, etc.). Below are some simple examples of the kinds of labeling functions that you could write for an email spam detector:

Programmatic labeling vs. manual labeling

Most training data created today is manually labeled whether it is done internally or carried out crowdsourced services. However, organizations face the following key challenges with manually labeled training data:

  • Manual labeling is painfully slow. Even with an unlimited labor force or budget, it can take person-months/years to deliver necessary training data and train models with production-grade quality.
  • Annotation eats up a significant portion of the AI development budget. Manual labeling is inherently expensive as it scales linearly at best and is often error-prone. Data science teams rarely have adequate annotation budgets.
  • Complex data sets need subject matter expertise. Most training data requires highly trained experts, SMEs, to label, e.g., doctors, legal analysts, network technicians, etc., who often need to be well-versed in specific organization’s goals and datasets. However, available SMEs are limited, and expensive to label each datapoint manually.
  • Adapting applications often requires relabeling from scratch. Most organizations have to deal with constant change in input data and upstream systems and processes and downstream goals and business objectives—rendering existing training data obsolete. This challenge requires enterprises to relabel training data constantly.
  • AI Applications are hard to govern. Most organizations need to be able to audit how their data is being labeled, and consequently, what their AI systems are learning from. Even when outsourcing labeling is an option, performing essential audits on hand-labeled data is a near impossibility.

Programmatic labeling implemented with Snorkel Flow enables enterprises to:

  • Reduce time and costs associated with labeling with automation.
  • Translate subject matter expertise into training data more efficiently with labeling functions.
  • Adapt applications to data drifts or new business needs with few push-button actions or code changes.
  • Make training data easier to audit, manage, and build goveranable AI applications.

Programmatic labeling using Snorkel Flow

The Snorkel AI team has applied programmatic labeling to a variety of real-world problems and learned much from our experience. We built the Snorkel Flow, a data-centric AI platform based on dozens of deployments of this technology. We aim to make this (and other state-of-the-art ML techniques) intuitive, performant, and accessible.

A few areas where Snorkel Flow facilitates full workflows based on programmatic labeling:

  • Explore your data at varying granularities (e.g., individually or as search results, embedding clusters, etc.)
  • Write no-code Labeling Functions (LFs) using templates in a GUI or custom code LFs in an integrated notebook environment
  • Auto-generate LFs based on small labeled data samples
  • Use programmatic active learning to write new LFs for unlabeled or low-confidence data point clusters
  • Receive prescriptive feedback and recommendations to improve existing LFs
  • Execute LFs at massive scale over unlabeled data to auto-generate weak labels
  • Auto-apply best-in-class label aggregation strategies intelligently selected from a suite of available algorithms based on your dataset’s properties
  • Train out-of-the-box industry standard models over the resulting training sets with one click in platform, or incorporate custom models via Python SDK
  • Perform AutoML searches over hyperparameters and advanced training options
  • Engage in guided and iterative error analysis across both model and data to improve model performance
  • Deploy final models as part of larger applications using your chosen production serving infrastructure
  • Monitor model performance overall and on specific dataset slices of interest
  • Adapt easily to new requirements or data distribution shifts by adjusting labeling functions and regenerating a training set in minutes.

Alternative approaches

Outsourcing / crowdsourcing

You can make manual labeling somewhat more “scalable” by parallelizing it across many annotators. This usually means outsourcing to annotators for hire, either through companies that specialize in these services or through marketplaces where annotators self-select. The difficulty is in finding annotators with domain expertise and managing any privacy constraints that limit who can view the data to be labeled. It has the same limitations as in-house manual labeling—high costs, poor scalability and adaptability, and low transparency for auditing training labels.

Model-assisted labeling

Model-assisted labeling (“pre-labeling”) uses an existing model to propose initial labels that a human can approve or reject. This can be seen as another type of improved UI for manual labeling. For some tasks—particularly those where collecting an individual label takes a long time or a lot of button clicks (e.g., image segmentation)—this can save the time that it would have taken to make those clicks yourself. But the human is still viewing and verifying each example that receives a label, creating the labels one-by-one without associated functions for adaptability or governance, making it another marginal improvement of manual labeling.

Active learning

Active learning techniques aim to make human labelers more efficient by suggesting the order in which examples should be labeled. For example, if time and money limit how many examples you can label, then you may prefer to label data points that are relatively more dissimilar from one another (and therefore more likely to contribute novel information) or those about which your model is least confident.

Reordering data points can boost manual labeling efficiency, particularly near the beginning of a project. As the model learns more, however, the problem of determining the most valuable points to label can be nearly as hard as the original problem, making gains over random sampling fairly marginal.

Intuitively speaking: if you knew exactly where the decision boundary between classes was, you wouldn’t need to train a classifier. This approach has the same issues around indirect knowledge transfer, adaptability, and governance as other manual approaches. It still operates on the level of individual labels collected one at a time without associated labeling rationale. In general, active learning is a classic, very standard, and very sensible way to make manual labeling more efficient—but it’s not a step change.

When to use programmatic labeling

  • When training labels need to be collected quickly.
  • When lots of training data is required for best results (most models improve with more data, but especially deep learning models).
  • When labeling requires domain expertise that is difficult to crowdsource.
  • When privacy concerns prevent data from being shared externally.
  • When the task experiences frequent shifts in schema or product needs (and therefore requires frequent relabeling).
  • When model governance or transparency is a priority. If you detect a bias or error mode, programmatic labeling enables you to quickly identify what labeling sources are responsible and rapidly address the issue.

Use cases for programmatic labeling

Programmatic labeling can be applied to most supervised learning problems. The Snorkel team has applied it to text data (long and short), conversations, time series, PDFs, images, videos, and more. The “labeling function” abstraction is flexible enough that the same workflow and framework applies in all cases—just swap out how you view your data, what types of labeling functions you use, and whichever model architecture is most appropriate for your data type.

Some of the use cases include:

  • Text and document classification
  • Information extraction from unstructured text, PDF, HTML and more
  • Rich document processing
  • Structured data classification
  • Conversational AI and utterance classification
  • Entity linking
  • Image and cross-modal classification
  • Time series analysis
  • Video classification

Learn how to get more value from your PDF documents!

Transforming unstructured data such as text and documents into structured data is crucial for enterprise AI development. On December 17, we’ll hold a webinar that explains how to capture SME domain knowledge and use it to automate and scale PDF classification and information extraction tasks.

Sign up here!