Snorkel Research Project
What is Programmatic Labeling?
Programmatic labeling is an approach to labeling that breaks through the primary bottleneck limiting AI today: creating high-quality training sets in a way that is scalable, adaptable, and governable.
The primary difference between manual labeling and programmatic labeling is the type of input that the user provides. With manual labeling, user input comes in the form of individual labels, created one by one. With programmatic labeling, users instead create labeling functions, which capture labeling rationales and can be applied to vast amounts of unlabeled data and aggregated to auto-label large training sets. This approach leads to a number of benefits over manual labeling:
Scalability: Once you have written a labeling function, no additional human effort is required to label your data—be it thousands or millions of data points—resulting in training datasets that are orders of magnitude larger and/or faster to create than those produced via manual labeling.
Adaptability: When requirements change, data drifts or new error modes are detected, training sets need to be relabeled. With a manual labeling process, this means manually reviewing each affected data point a second, third, or tenth time, multiplying the cost of both time and money to develop and maintain a high-quality model. When you produce your labels programmatically, on the other hand, recreating all your training labels is as simple as adding or modifying a small, targeted number of labeling functions and re-executing them, which can now occur at computer speed, not human speed.
Governability: When labeling by hand, users leave no record of their thought process behind the labels they provide, making it difficult to audit what their labeling decisions were—both in general and on individual examples. This presents a challenge for compliance, safety, and quality control. With programmatic labeling, every training label can be traced back to specific inspectable functions. If bias or other undesirable behavior is detected in your model, you can trace that back to its source (one or more labeling functions) and improve or remove them, then regenerate your training set programmatically in minutes.
What is a Labeling Function?
A labeling function is an arbitrary function that takes in a data point and either proposes a label or abstains. Nothing is assumed about the logic inside the function, which makes it a very flexible interface for incorporating domain knowledge from lots of different formats and sources.
Labeling functions can be quite diverse, ranging from simple heuristics such as looking for a specific keyword or phrase in a text field to more complex functions that wrap other models, perform a database lookup, or utilize embeddings.
Labeling functions do not need to be comprehensive or cover your entire dataset. For many applications, just having a few labeling functions per class is sufficient to create a training set with the information your model needs to perform well on your task as it generalizes beyond the labeling functions. Learn more.
Types of Labeling Functions
Programmatic Labeling vs. Manual Labeling
Most training data created today is manually labeled whether it is done internally or carried out crowdsourced services. However, organizations face the following key challenges with manually labeled training data:
- Manual labeling is painfully slow. Even with an unlimited labor force or budget, it can take person-months/years to deliver necessary training data and train models with production-grade quality.
- Annotation eats up a significant portion of the AI development budget. Manual labeling is inherently expensive as it scales linearly at best and is often error-prone. Data science teams rarely have adequate annotation budgets.
- Complex data sets need subject matter expertise. Most training data requires highly trained experts, SMEs, to label, e.g., doctors, legal analysts, network technicians, etc., who often need to be well-versed in specific organization’s goals and datasets. However, available SMEs are limited, and expensive to label each datapoint manually.
- Adapting applications often requires relabeling from scratch. Most organizations have to deal with constant change in input data and upstream systems and processes and downstream goals and business objectives—rendering existing training data obsolete. This challenge requires enterprises to relabel training data constantly.
- AI Applications are hard to govern. Most organizations need to be able to audit how their data is being labeled, and consequently, what their AI systems are learning from. Even when outsourcing labeling is an option, performing essential audits on hand-labeled data is a near impossibility.
Programmatic labeling implemented with Snorkel Flow enables enterprises to:
- Reduce time and costs associated with labeling with automation.
- Translate subject matter expertise into training data more efficiently with labeling functions.
- Adapt applications to data drifts or new business needs with few push-button actions or code changes.
- Make training data easier to audit, manage, and build goveranable AI applications.
Programmatic Labeling Using Snorkel Flow
For over half a decade, the Snorkel AI team has applied the concept of programmatic labeling to a wide array of real-world problems and learned much over that time. We built the Snorkel Flow, a data-centric AI platform based on dozens of deployments of this technology with the goal of making this (and other state-of-the-art ML techniques) intuitive, performant, and accessible.
A few areas where Snorkel Flow facilitates full workflows based on programmatic labeling:
- Explore your data at varying granularities (e.g., individually or as search results, embedding clusters, etc.)
- Write no-code Labeling Functions (LFs) using templates in a GUI or custom code LFs in an integrated notebook environment
- Auto-generate LFs based on small labeled data samples
- Use programmatic active learning to write new LFs for unlabeled or low-confidence data point clusters
- Receive prescriptive feedback and recommendations to improve existing LFs
- Execute LFs at massive scale over unlabeled data to auto-generate weak labels
- Auto-apply best-in-class label aggregation strategies intelligently selected from a suite of available algorithms based on your dataset’s properties
- Train out-of-the-box industry standard models over the resulting training sets with one click in platform, or incorporate custom models via Python SDK
- Perform AutoML searches over hyperparameters and advanced training options
- Engage in guided and iterative error analysis across both model and data to improve model performance
- Deploy final models as part of larger applications using your chosen production serving infrastructure
- Monitor model performance overall and on specific dataset slices of interest
- Adapt easily to new requirements or data distribution shifts by adjusting labeling functions and regenerating a training set in minutes.
Approaches for Dealing with Limited Training Data Other Than Programmatic Labeling —
Outsourcing / Crowdsourcing
When Should Programmatic Labeling Be Used?
- When training labels need to be collected quickly.
- When lots of training data is required for best results (most models improve with more data, but especially deep learning models).
- When labeling requires domain expertise that is difficult to crowdsource.
- When privacy concerns prevent data from being shared externally.
- When the task experiences frequent shifts in schema or product needs (and therefore requires frequent relabeling).
- When model governance or transparency is a priority. If you detect a bias or error mode, programmatic labeling enables you to quickly identify what labeling sources are responsible and rapidly address the issue.
Use Cases of Programmatic Labeling
Programmatic labeling can be applied to most supervised learning problems. The Snorkel team has applied it to text data (long and short), conversations, time series, PDFs, images, videos, and more. The “labeling function” abstraction is flexible enough that the same workflow and framework applies in all cases—just swap out how you view your data, what types of labeling functions you use, and whichever model architecture is most appropriate for your data type.
Some of the use cases include:
- Text and document classification
- Information extraction from unstructured text, PDF, HTML and more
- Rich document processing
- Structured data classification
- Conversational AI and utterance classification
- Entity linking
- Image and cross-modal classification
- Time series analysis
- Video classification
Where to Learn More about Programmatic Labeling
Accelerate AI with Programmatic Labeling
It’s clear that programmatic labeling can be instrumental to not only automating the labeling process while keeping human-in-the-loop but also to accelerating AI development.
But where do you start? Rather than dealing with annotation guides and contracts for crowdsourced labeling, see Snorkel Flow in action.
With Snorkel Flow, Fortune 1000 organizations such as Chubb, BNY Mellon, Genentech, and more have built accurate and adaptable AI applications fast by putting the power of programmatic labeling to use.