Data labeling: a practical guide (2024)
Data labeling remains a core requirement for any organization looking to use machine learning to solve tangible business problems, especially with the increased development and adoption of LLMs. How you develop and use your proprietary data is the key to unlocking its value and to delivering accurate, reliable, and trustworthy AI and ML applications.
Use this handbook to gain a thorough understanding of data labeling fundamentals as they apply to both predictive and generative AI—and to find the right approach for your project.
Data labeling in the age of large language models (LLMs)
The recent explosion of large language models (LLMs) and generative AI (genAI) has made data labeling more necessary than ever.
LLMs embody a wealth of useful information in their pre-trained weights, but they typically fall short of a full solution out of the box. Gaps in their knowledge, or a lack of ability to transfer what they know to the task at hand can hinder their performance.
This is where data labeling fits in.
Instruction tuning: an essential new step in developing AI models
Instruction tuning (fine-tuning on high-quality responses to instructions) has emerged as an important step in developing performant large language models for genAI tasks. This involves fine-tuning powerful, publicly available LLMs through the curation of additional labeled data.
ML researchers and practitioners at AI startups and large enterprises have relied on curating additional labeled data to achieve better performance on specific tasks. OpenAI has reported using human labelers to collect the data for fine-tuning its GPT models and reportedly hired hundreds or thousands of additional contractors since ChatGPT was released. Meta reported using 10 million human-annotated examples to train Llama 3.
This points to a future where enterprise development of LLMs centers on data development, and data labeling remains a core requirement.
What is data labeling?
Labeled data teaches models how to understand inputs and enables them to make useful predictions. That makes data labeling a foundational requirement for any supervised machine learning application—which describes the vast majority of ML projects.
At a basic level, data labeling prepares data sets to teach models what inputs correspond to which outputs. This process takes raw documents, files, or tabular records and adds one or more tags or labels to each.
This approach applies across all data modalities. For example:
- Written content moderation: Community administrators can use content that has been labeled as unacceptable to develop a model to flag—or block—potentially offensive posts.
- Audio classification: A record company that acquires a new catalog can build a model to assign each song to a genre based on the labels and sonic qualities in their existing catalog.
- Visual object detection: A computer vision model may learn from captions or labels attached to pictures to predict whether photos contain things like dogs, cats, bridges, automobiles, or bicycles.
Focusing primarily on developing data sets falls into the category of data-centric AI— which stands in contrast to model-centric AI. In model-centric AI, data scientists assume that data sets are static and aim to achieve better results by optimizing model architectures and parameters. Data-centric AI assumes that approaches like AutoML will identify appropriate model architectures, and the best way to improve performance is through developing clean and robust training data.
Some machine learning algorithms, such as clustering and self-supervised learning, do not require data labels, but their direct business applications are limited. Use cases for supervised machine learning models, on the other hand, cover many business needs.
All of this makes data labeling a vital part of the machine learning pipeline.
Key approaches to data labeling
When enterprises decide to embark on a new AI project, one of the first decisions project leaders must make is how to approach their data labeling process. While nuances exist, their choices generally fall into the following categories:
- Internal manual labeling
- External manual labeling
- Semi-supervised labeling
- Programmatic labeling
Finding the right data labeling approach for your project requires first understanding the most common approaches to the task as well as each of their advantages and disadvantages. With a solid background, you can easily select the approach best suited to your project and organization.
In the following section, we will examine each of these approaches and help you choose.
Internal manual labeling
Internal manual labeling by employees or experts is the most fundamental technique for developing training data for machine learning models.
At a basic level, it involves manually examining each data point and using subject-matter expertise to label it. Data scientists and machine learning engineers generally consider this approach as the gold standard for data quality.
In some cases, this is the only approach that makes sense. For example, asking physicians to hand label radiology images for developing a computer vision model.
However, the manual approach presents challenges. Labelers must possess the expertise required to label the examples appropriately. Additionally, organizations need significant financial resources to scale this approach and develop sufficient training sets for modern models.
External manual labeling
In external manual labeling (also known as “crowdsourcing”), data teams break labeling tasks into chunks and assign them to individuals—generally contractors or temp workers. The workers then apply labels according to the rules defined for the task.
The ability to add more workers on demand and dismiss them at job completion allows data teams to scale their data-generation capacity as needed while reducing or eliminating the need to hire more employees. It also reduces costs by shifting data labeling work away from specialized, high-value employees.
While a good option under some conditions, outsourcing your data labeling to third-party contractors can create considerable challenges, such as:
- Linear cost scaling.
- Privacy concerns.
- Lack of subject matter expertise.
- Increased management overhead.
- Poor data quality.
Barriers to high-quality data labels range from spammers to simple human error. Crowdsourcing platforms mitigate these challenges with control measures, but these do not catch every issue.
Data science teams may also struggle to find crowdsourcing vendors with the subject matter expertise required for their labeling task. This can result in poor data quality or significant overhead created by training and managing crowd labelers.
Linear cost scaling can also present a roadblock to artificial intelligence projects. After initial startup costs, each additional label costs a fixed amount. For tasks that require an enormous amount of labeled data, data science teams will want sub-linear cost scaling.
Semi-supervised labeling
Semi-supervised labeling leverages the structure of unlabeled data to complement the labeled data.
In one semi-supervised approach, data science teams collect a small number of initial labeled data points. With these, they train two or more small models using different model architectures. Then, they ask these models to predict the labels for the unlabeled data and apply the predicted labels when two or more models agree.
In another approach, data scientists can use graphs between the points or distances within the feature space to propagate labels from known points to nearby points.
Subsequently, the data science team trains a final model over the entire dataset, allowing them to boost model performance with a smaller amount of labeled data.
These approaches make assumptions about smoothness, low-dimensional structure, or distance metrics, and may be a poor match for some machine learning problems.
Automated data labeling (model distillation)
In automated data labeling, data scientists use existing foundation models to label data at scale with minimal human intervention. Also known as “model distillation,” this approach leverages the significant knowledge embodied in enormous models as a “teacher” to train a “student” model—which is typically much smaller.
Data scientists can apply this approach with large language models or multimodal models for image, video, or audio-classification tasks. Initial results can appear powerful (especially when bolstered by clever prompting techniques) but often fall short of production deployment needs.
Foundation models, trained on a wide array of unstructured data, embody general knowledge. This limits their ability to accurately return labels for specialized domains. In one Snorkel case study, Google’s PaLM achieved an F1 of 50 on classifying statements typed into a banking chatbot. Several rounds of focused prompt engineering lifted that F1 to 69. While a significant improvement, it would not have been accurate enough for an enterprise to deploy.
This approach works well for early experiments or applications that can tolerate low accuracy. However, model distillation or automated data labeling can serve as a useful starting point for programmatic labeling.
Programmatic labeling
Programmatic labeling (also known as “weak supervision”) stands apart from other data labeling approaches because it can incorporate all of them. This approach combines sources of supervision—ranging from human-supplied labels to noisy high-level heuristics to LLMs—to create large probabilistic data sets orders of magnitude faster than manual labeling.
By combining signals and learning where they agree and disagree, the weak supervision algorithm learns when, where, and how much to trust each one. This results in a set of confidence-weighted labels a data science team can use to train a final machine learning model.
The Snorkel Flow AI data development platform pairs weak supervision with labeling functions. Data teams collaborate with subject matter experts to build labeling functions, which are rules that can apply to large portions of the data set. For example, you may assume that all emails that mention prescription drugs or wire transfers are spam. This allows subject matter experts to apply their intuition to dozens, hundreds, or thousands of records at once.
While labeling functions can take the form of simple rules, they can also gather input from large language models or external knowledge sources to achieve things that regex searches cannot.
Snorkel Flow users iterate through a data development loop in which they add, edit, and remove labeling functions according to how they impact the performance of the model.
This approach also lets users dynamically adapt labeling schemas. When project leaders discover they need more categories than planned, they can adapt their labeling functions without starting from scratch.
This can result in teams building deployable models in days that could take months through manual labeling alone.
Sources of supervision could include any of the following:
- Simple, rule-based heuristics
- Legacy models
- Crowdworker labels
- Subject matter expert manual labels
- External predictive services
- Prompts fed to large language models
Selecting the appropriate technique for your team
When choosing an appropriate data labeling process, data science teams should evaluate their options along a few key considerations:
- The project’s budget.
- Whether the data must remain private.
- The scalability of the approach.
- The adaptability of the approach.
- Whether the approach leverages domain expertise.
Internal manual labeling | External manual labeling | Semi-supervised labeling | Automated labeling | Programmatic labeling | |
Leverages domain expertise | |||||
Budget friendly | |||||
Data privacy | 🔷 | ||||
Scalability | |||||
Adaptability |
Leverages domain expertise
Complexity often determines which approaches make sense for which machine learning project.
Problems that call for domain expertise can put crowdsourcing out of reach. Data teams may struggle to develop labeling instructions that induce correct responses from crowd workers, and the nature of the data may make it challenging to leverage automated or semi-automated techniques.
Domain-expert-friendly techniques:
- Internal manual labeling
- Programmatic labeling
Budget-friendly
Budget can prove a significant hurdle for labeling projects. Data labeling costs directly impact the return on investment yielded by the final application, and can often halt a project before it starts.
Projects that promise to make or save a company a large amount of money can take advantage of more expensive means to label data, such as crowdsourcing and in-house manual labeling with subject matter experts. However, this is often not the case, which may drive teams to adopt more cost-efficient approaches.
Budget-friendly techniques:
- Semi-supervised labeling
- Programmatic labeling
- Crowdsourcing (depending on dataset size)
- Automated data labeling
Data privacy
In privacy-sensitive organizations, like healthcare and financial institutions, teams are often unable to distribute the data to third-party groups for labeling.
Oftentimes, organizations even restrict which internal teams may interact with the data. In other cases, teams must adopt in-house techniques for data labeling such as internal manual labeling and programmatic labeling.
Privacy-friendly techniques:
- Internal manual labeling
- Semi-supervised labeling
- Programmatic labeling
- Automated data labeling*
*Whether or not automated data labeling/model distillation keeps data private will depend on the deployment architecture of the foundation model. A locally-deployed, open source foundation model will be reliably private. Using an external API may not be.
Scalability
LLMs and other data-hungry models make it increasingly necessary for data teams to prepare large amounts of labeled data in order to achieve production-ready results.
If the use case requires a high degree of precision or recall or the data domain is very specific, the model will require a greater volume of labeled data to reach performance goals. Machine learning teams working on these projects look to data labeling approaches that can scale appropriately to their use case.
Scalability-friendly techniques:
- External manual labeling
- Semi-supervised labeling
- Programmatic labeling
- Automated data labeling
Adaptability
Few machine learning models remain static. Teams update them over time as the data distribution shifts or business objectives change. This encourages teams to adopt adaptable data labeling approaches that can be easily integrated into existing pipelines and modified over time.
For example, when building models for fast-moving spaces (e.g. news analytics), the labeling schema may change several times per year. In these situations, ML teams need adaptable approaches to label and re-label data. With non-adaptable labeling approaches, the team has to revisit each data point by hand. This leads to efficiency losses that may outweigh the gains from ML.
With a more adaptable approach like programmatic labeling, data teams can adjust the small set of labeling functions (e.g. adding new ones to capture a newly added class) and regenerate and updated set of training labels in seconds.
Highly domain-specific use cases (e.g. labeling radiology images) will likely maintain a more static labeling schema, and benefit less from modular approaches.
Modularity-friendly techniques:
- Programmatic labeling
- Automated data labeling
Data labeling: the foundation of any AI project
Nearly every AI application starts with data labeling, and the choices teams make in how to label their data can have a drastic impact on the project’s return on investment.
Teams may consider several approaches—from internal manual labeling to programmatic labeling—but their ultimate choice will depend on the complexity of the task, budget considerations, scale, and data privacy. Regardless of the use case or ML maturity of the organization, an effective data labeling strategy is paramount for the success of any ML project.
Learn More
Follow Snorkel AI on LinkedIn, Twitter, and YouTube to be the first to see new posts and videos!