AI data development: a guide for data science projects
What is AI data development? AI data development includes any action taken to convert raw information into a format useful to AI.
This definition covers everything from the point of acquiring data to the point at which a data scientist exports a model for deployment. Many of these actions overlap with or bolster other data management techniques (such as tabularization and feature engineering, which may be useful for setting up databases and enabling downstream data analysis), but AI data development techniques focus on preparing data to train useful, deployable models.
For most of machine learning’s history, data scientists treated data as a static resource; it came pre-labeled by business or data collection processes. Data scientists mostly accepted the data as it was and iterated on model architectures and hyperparameters to build the best-performing application they could. This left teams constrained in what models they could build and how.
Data scientists have always engaged in some amount of data development. They normalize values, drop rows with missing data, and convert categorical columns into multiple boolean columns. The modern view of AI data development, which falls under the category of “data-centric AI” reframes how data scientists deal with data to include labeling and re-labeling data to improve model performance—often while keeping the base model or model architecture static.
In short, data development treats data like software—as a resource to iteratively change and improve to fit the project’s needs.
In this explainer, we will describe what AI data development is and why we think data science teams at every enterprise should embrace a wider definition of it.
The classical state of data
Enterprise data science teams have classically worked with data created in one of the following ways:
- Data automatically labeled by business processes. When a customer cancels their account, the database automatically labels that event.
- Data labeled as a byproduct of the collection process. For example, data scientists could designate any column from U.S. Census Bureau table as a target value.
- Data hand-labeled by subject matter experts (SMEs). Reserved for high-value applications, businesses sometimes pay to have experts meticulously label examples.
Regardless of the labels’ source, classical labeling approaches limited how data scientists could use them. To build a model around something that’s not naturally tagged—for example, categorizing paragraphs in a contract—required a lot of work. Data teams would have to develop a labeling schema and guidelines for that schema, then recruit and manage human labelers to apply each label.
Projects like these happened infrequently in the enterprise. This kind of labeling typically requires highly skilled (and expensive) annotators who don’t enjoy the labeling process.
The classical state of AI data development
AI data development has existed longer than the title of data scientist. Whatever you call them, the technicians who extract insights from data have always done some or all of the following:
- Remove rows with missing data. Most data sets include some rows with missing data due to factors ranging from entry error to data corruption.
- In-filling rows with missing data. Data scientists can use a variety of deterministic or probabilistic methods to fill cells that have been purged or left empty.
- Managing outliers. High-leverage outliers can significantly reduce model performance. Data scientists often identify and remove them.
- Generating binary columns. Textual columns may include useful information that must be made numerical for the model to use it. Often this means building binary columns.
- Feature selection. Often, columns in a dataset mirror each other. Removing duplicate columns tends to improve performance.
- Deduplication. Data sets often include duplicate records. Leaving them in a model’s training set can distort performance.
- Class balancing. Many data science problems seek to identify low-frequency events. Rebalancing a data set to make the low-frequency events appear more frequent during training encourages models to pay more attention to them.
All of these approaches focus on structured data, but the core principles remain valid today. Data requires iteration and development to get the best possible performance out of a wide variety of models, whether that be a simple linear regression or something as complicated as fine-tuning Llama 3.1 405B.
AI data development in the modern age
Enterprise data scientists in the classical state of AI data development created tools and techniques to heighten the ability of models to learn from business data. But these data scientists had minimal input into what the original data looked like. Some forward-looking companies consulted data science teams on what data they should collect and how, but they were the exception.
For many years, this worked. Businesses had a wealth of high-value challenges they could solve with existing data sources. Now, data scientists increasingly face tasks that require looking beyond their company’s structured data.
As the challenges facing data scientists have increased in sophistication, so have the tools available to them; data scientists may now help shape labeling schemas and the way those labels get applied.
AI data development with crowd labelers
The internet has long enabled crowd-labeling through forums and social media. Amazon Web Services formalized crowd-sourcing capabilities when it introduced its Mechanical Turk service in 2005. Since then, crowd-labeling enterprises, tools, and infrastructure have increased in sophistication and availability.
Now, data science teams can craft the schemas they need to solve challenges important to their business and work with companies dedicated to crowd labeling. These firms help data science teams develop their schemas and labeling guidelines, distribute tasks among workers, overlay multiple workers’ chosen labels to reduce error rates, and help them iterate on their approach as difficulties arise.
However, the scope of this approach remains limited for the following reasons:
- Lack of specific expertise. Crowd-labeling platforms recruit labelers with knowledge of specific topics, but the highest-value models often require labels applied by highly specialized professionals. A crowd-labeling platform may provide workers with generalized medical knowledge, but few will have the training required to spot warning signs for lung cancer in electronic medical records.
- Privacy. Businesses in highly regulated industries often cannot legally share their data with anyone. Other firms also have reasons to keep their data in-house—for example, their terms of service may promise to keep customer data private.
- Rigidity. If a business needs to add or remove a label from its schema, crowd labelers must start again from zero.
AI data development with subject matter experts
To develop novel data for a novel AI system, data scientists can simply ask experts to label the data. This approach has yielded some powerful and useful models—particularly in the medical field—but it often proves to be a non-starter.
Organizations have a limited number of experts capable of labeling records for high-value applications. Those experts typically work on tasks with obvious and immediate value. Data scientists who attempt to recruit experts to hand-label data can struggle to convince management that the model will be worth the experts’ time.
Additionally, while expert labeling solves the privacy and expertise challenges presented by crowd labelers, it does not solve the rigidity problem. Prior to working with Snorkel, one of Snorkel’s banking customers assigned in-house experts to classify contracts into eight categories. They cumulatively spent more than a month on the task before project leadership realized the model called for more than twenty classes—rendering the initial labeling spring worthless.
AI data development with programmatic labeling
Programmatic labeling on the Snorkel Flow AI data development platform equips data science teams with scalable tools to amplify SMEs’ labeling impact.
Data scientists work with SMEs to create labeling functions that scale to cover the entire data set. In some cases, these functions apply simple rules, such as assigning the “employment” label to any contract that includes “employment” in the title. The framework, however, extends to cover powerful and complex tools such as checking external ontologies or querying large language models (LLMs)
In the abstract, this can look like a rule-based system, but programmatic labeling yields much more robust and flexible results.
After composing a handful of labeling functions, the data scientist trains a quick, intermediate model to investigate the labels they’ve generated. They use analysis tools to see what labels in the verification set the model performs poorly on and how, and then develop new labeling functions to correct those errors.
As data scientists develop labeling functions, some will conflict, often deliberately. For example, imagine a spam-detection model that incorrectly flags legitimate emails from banks as suspect. The data scientist examines the error and concludes that it originates from a previous labeling function, which declared all emails that referred to wire transfers as spam.
To correct this issue, the data scientist creates another labeling function: all emails from verified bank email addresses should be labeled as “not spam.”
This cycle of examining shortcomings and instituting corrections continues until the quick model nears or exceeds the previously established deployment threshold. Then, the data scientist chooses an appropriate final model architecture, trains the model, and exports it for deployment.
AI data development and GenAI
LLMs can be a great help in programmatically labeling data. While far from perfect, off-the-shelf LLMs and cleverly crafted prompts can often build a solid baseline for categorical labeling functions. On Snorkel Flow, users frequently start a new project by building a labeling function that prompts an LLM for an initial set of labels that capture the application’s “low-hanging fruit.” Then, they work with experts to build additional labeling functions to correct where the LLM goes astray.
Data science teams charged with building specialized Generative AI (GenAI) models face a different set of challenges. Their toolbox includes many of the same operations required for traditional models, such as deduplication, removing outliers, and balancing classes. It also includes additional processes, such as data augmentation.
Regardless of the specific process, the nature of the data required for GenAI complicates AI data development exercises.
Data filtering
Curating the documents used to train generative models presents a meaningful challenge. A model trained on posts from internet cesspools will behave very differently from one trained on modern textbooks. Drawing the line between good and bad means making tricky choices, as does enforcing that line.
Ensuring your training set has the right documents isn’t enough. Data scientists want the right balance of documents within the training set, and they want each document to appear only once; duplicated documents can distort GenAI models.
For a simple example, consider “Green Eggs & Ham.” The book has a place in the pre-training corpus of a large language model; its patterns and structure will help the model understand how to create content that sounds like children’s stories. Now imagine some accident caused duplicate copies of “Green Eggs & Ham” to make up 5% of the LLM’s pretraining corpus.
The resulting model might tend to generate unnecessarily repetitive responses and insist that both eggs and ham are green.
Avoiding this problem requires scalable solutions. Humans can easily identify two duplicate images or documents, but the size of GenAI pre-training corpus (BloombergGPT used more than 700 billion pre-training tokens) puts this beyond the capability of human labelers—even a large crowd of them.
Instead, data scientists use vectorizations, distance metrics, and clever mathematical shortcuts to spot documents that are too similar to coexist.
Curating data for fine-tuning
Generative models require a second phase of training called fine-tuning. In this phase, the model learns how to follow instructions. Data scientists can use additional rounds of fine-tuning to reorient foundation models around their organizations’ targeted industry or tasks. Regardless, the quality of fine-tuning data meaningful impacts the performance of the final model.
In a demonstration of this, Snorkel researchers created an updated version of the open-source RedPajama model. The original model used Meta’s LLaMA architecture and a close approximation of the LLaMA pre-training corpus. The researchers did not have access to Meta’s fine-tuning corpus. In its place, they used two human-generated open-source data sets with about 20,000 prompt/response pairs between them.
Snorkel’s researchers curated those 20,000 pairs down to the best 10,000 and rebalanced the data set to increase the share of prompts from previously under-represented classes. Then, they fine-tuned their version of the model. In a double-blind study, human testers preferred the new version in every category.
The Snorkel researchers could have used crowd labelers to classify prompts according to categories and responses as high or low quality. Instead, they used programmatic approaches and developed assistive models to label the corpus. Using Snorkel Flow, the process took the pair of researchers about a day. They spent more time verifying the impact of their curation efforts than they did performing them.
Class balancing for GenAI
Class balancing data for GenAI poses unique challenges. Not only do you want to include more examples of classes that are more important to your final use case (as you would with a traditional model), but you also want to de-emphasize classes on which the foundation model already performs well.
This assumes that data scientists have classes for their training documents, which they rarely will; by the nature of generative tasks, the training data generally comes unstructured.
In the RedPajama case study, our researchers built a model to classify prompt and response pairs into six categories according to those present in the Dolly 2.0 dataset. This approach effectively allowed the team to spread the Dolly labels onto records from the OpenAssistant dataset.
Data augmentation
While data scientists sometimes use data augmentation for tabular data, the practice takes on increased importance when working with unstructured data, which can be difficult to source.
For a simple example, imagine a model aimed at identifying the faces of members of an elected body. After initial data collection and cleaning, the project’s engineers may only have half a dozen images of some officials—which is not enough to train a model. However, by stretching, skewing, cropping, flipping, and rotating the original images, data scientists can reach the training data threshold necessary to achieve reliable predictions.
Data augmentation can take several forms, but here are two examples:
- Synthetic text data. Snorkel worked with a client who wanted to build a generative model for use in end-user applications. However, the client promised in its terms of service not to use end-user data in model training. Snorkel engineers helped generate synthetic texts that mirrored real end-user data, enabling the client to train the model.
- Recontextualizing data. Engineers at Snorkel helped a client build re-contextualized images of products. This extended beyond the image augmentation described above to include images of the products in real-world situations—such as a synthetic image of a rug sold by the client dangling from a washing machine.
These data augmentation techniques help make the final model more robust.
AI data development: the most efficient way to improve models
As enterprises and data scientists move onto new, higher-value challenges and into the realm of foundation models and GenAI, AI data development—and more aggressive AI data development—has become the most efficient way to improve model performance. While hyperparameter tuning and nudges to training processes remain important for cutting-edge research, AutoML approaches, and foundation model fine-tuning APIs have largely obviated them for practical application development.
In those realms, the only practical way to boost model performance is through higher-quality data at a model-appropriate scale.
Learn how to get more value from your PDF documents!
Transforming unstructured data such as text and documents into structured data is crucial for enterprise AI development. On December 17, we’ll hold a webinar that explains how to capture SME domain knowledge and use it to automate and scale PDF classification and information extraction tasks.