In 2007, Google researchers published a paper on a class of statistical language models they dubbed “large language models”, which they reported as achieving a new state of the art in performance. They used a very standard model and a decoding algorithm so simple they named it “Stupid Backoff”1. The key differentiator? They trained it on 100x the amount of data.

Better data almost always has a greater impact than fancier models or algorithms in AI—and yet, data development has always been undersupported by AI formalisms and technology. For every model development step in the modern journey of building AI applications, there is a critical but often underappreciated data development step, where the data that actually informs the model is selected, labeled, cleaned, shaped, and curated. These data-centric operations can greatly affect end outcomes, as recent results with today’s large language or “foundation” models show again2. And yet, they are often treated as “janitorial” work and handled in manual, ad hoc, painfully inefficient ways. In practice, data-centric operations have become one of the largest blockers to AI value realization.

The Snorkel team’s mission over the last eight-plus years has been to make AI data development first-class and programmatic so that it can look more like an afternoon of software development than months of outsourced manual annotation. With Snorkel Flow, we’ve transformed labeling training sets from an ad hoc manual process into a programmatic one, accelerating time to value by 10-100x+ and leading to better model accuracies for our enterprise customers, including five of the top 10 US banks, government agencies, and more.

Today, I’m excited to announce our next step in this journey with Snorkel’s Foundation Model Data Platform, which supports the broader set of data-centric operations involved in developing modern foundation models (FMs)–from sampling, filtering, and curating the datasets for domain-specific pre-training to authoring and cleaning the instruction-tuning datasets for generative AI alignment–making them first-class and programmatic. Our goal is to enable every enterprise to build AI that works for their unique data and use cases and to turn this into a powerful AI moat for their business.

Achieving real enterprise outcomes with GPT-You, not GPT-X

Modern large language or “foundation” models have become incredibly powerful over the last several years due to a combination of data and compute scaling and “deep” model architectures. The leap forward has been nothing short of astonishing—even for those of us who have been building AI systems for many years.

However, as many are beginning to realize at this point in the current hype cycle trajectory: these models do not solve all problems out of the box and often need significant customization3. These models are not magic; to start, they have only been trained to sound statistically plausible given a prompt—not to be accurate, unbiased, and truthful on specific mission-critical tasks across unique datasets and domains.

This is especially true in enterprise settings where data and use cases are often very different from the web data these models were trained on and where high, robust accuracy is required for production deployment (e.g., no human copilot will be present, and critical outcomes are affected). Modern foundation models can be considered impressive generalists, like a star undergraduate just out of college; most enterprise use cases need highly accurate specialists, trained and onboarded for specific scenarios.

Today, the key to accomplishing this is data: it’s what turned GPT-3 into ChatGPT and will turn a base foundation model into a “GPT-You” that is customized and optimized for your unique data and use cases4. As models have gotten larger, directly modifying them has become increasingly impractical; e.g., how are you going to tweak the architecture of a 500B+ parameter model to address an error mode? Instead: it’s all about developing the data these models train on for specific use cases. For every traditional “model-centric” step (e.g., pre-training, instruction-tuning, fine-tuning), there is a critical corresponding data-centric step, where the data that actually informs the model is selected, shaped, curated, labeled, and, more broadly, developed.

Data-centric and training operations for developing and aligning large language and foundation models

These data-centric development steps are some of the most critical—and increasingly, only—interfaces to developing AI that works on real production use cases. However, they are rarely supported in first-class, systematic ways. They are often relegated to manual ad hoc or outsourced processes that are slow, error-prone, and infeasible for many organizations.

Moreover, as model architectures, algorithms, and public datasets rapidly commoditize, enterprises’ private data and knowledge becomes one of AI’s most important sources of differentiation. Developing this data for AI usage is often overlookedbut it is one of the most powerful ways to build an AI moat.

Making data-centric AI first-class and programmatic with Snorkel’s Foundation Model Data Platform

Today, we introduced our Foundation Model Data Platform, which supports the critical data-centric operations at each step of the AI development journey–from filtering and sampling the right mixture of data for pre-training to labeling data for fine-tuning–as first-class, programmatic operations.

Snorkel’s Foundation Model Data Platform

Snorkel’s Foundation Model Data Platform consists of three core solutions:

  • Snorkel Foundry: For programmatic curation and management of datasets for domain-specific pre-training of foundation models. Snorkel Foundry applies a programmatic approach for all of the critical but under-supported data-centric operations involved in curating a pre-training dataset and managing, monitoring, and adapting it over time, like identifying and sampling the right mixture of data sources and types, filtering out low-quality and less relevant data points; cleaning and deduplicating; weakly supervising auxiliary tasks; and more.
  • Snorkel GenFlow: For programmatic curation, annotation, and management of instruction datasets for generative AI use cases (e.g., summarization, chat, Q&A, etc.). Snorkel GenFlow applies a programmatic approach to developing and managing the datasets required by instruction-tuning methods like RLHF, including routing, authoring, and reviewing workflows for dataset creation; programmatic sampling and filtering for optimal tuning of instruction dataset distribution and quality; and more.
  • Snorkel Flow: For programmatic labeling of predictive AI use cases (e.g., classification, tagging, extraction). With Snorkel Flow, enterprises can use their expert knowledge, organizational resources, and FM/LLMs as supervision sources to programmatically label large amounts of high-quality training data to train models or fine-tune FM/LLMs.  Snorkel Flow is used by customers including five of the top ten US banks, healthcare providers like Memorial Sloan Kettering, and other Fortune 500 companies and government agencies to label data and train or fine-tune models 10-100x+ faster, using our unique programmatic approach to data labeling and development.

With the Foundation Model Data Platform, our goal is to make data-centric AI development less like manual, ad hoc work and more like software development so that every organization can develop and maintain a “GPT-You” that works on their enterprise-specific data and use cases.

Building for real AI value with a data-first approach

While a small number of loud voices proclaim apocalyptic futures, most practitioners today are stuck on the challenges of getting AI models to achieve production accuracy, avoid hallucinations and biases, and work reliably in real, high-value settings. The key to solving these challenges comes down to what AI has always come down to: getting the data right.

Our view is that the future of AI will not look like a “one model to rule them all” autocracy but rather an ecosystem of custom models as rich and diverse as the datasets, tasks, and settings they are specialized for. This is a necessity if we want to move from our current peak in the hype cycle to real production value. It’s also a historic opportunity for every enterprise to build a durable AI moat using their own data and knowledge.

“Data is the big disrupter… and differentiator. I believe that programmatic will become an essential tool if we want to make a difference to our business at scale. Before Snorkel, we needed humans to go through 40 million products and try to label things. Now, with programmatic, we can say, ‘OK, let’s re-label our catalog because there are items that we already have in this new style that people are talking about’. So, that is a true value unlock from a business point of view.”

Tulia Plumettaz, Director of Machine Learning, Wayfair – Diginomica

“Snorkel AI’s new foundation model platform has the potential to significantly enhance how Azure customers build, fine-tune, and apply large language models across their business. This could fundamentally shift the current paradigm, making AI more accessible and customizable for every enterprise, regardless of size or industry.”

John Montgomery, Corporate Vice President, Program Management, AI Platform at Microsoft5

We are excited to be building our Foundation Model Data Platform in collaboration with early customers such as Wayfair and cloud partners such as Microsoft to enable every enterprise to build its own AI success story using the power of its unique data, knowledge, and objectives.

If you are interested in accelerating the data backbone of your AI strategy with Snorkel’s Foundation Model Data Platform, please connect with our team here.


(1) Brants et al. 2007, “Large Language Models in Machine Translation” 

(2) Gadre et al. 2023, “DataComp: In search of the next generation of multimodal datasets”; Hoffmann et al. 2022, “Training Compute-Optimal Large Language Models”; Taori et al. 2023, “Alpaca: A Strong, Replicable Instruction-Following Model”; Geng et al. 2023, “Koala: A Dialogue Model for Academic Research”; Chiang et al. 2023, “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality

(3) Kocoń et al. 2023, “ChatGPT: Jack of all trades, master of none”; Pikuliak 2023, “ChatGPT Survey: Performance on NLP datasets

(4) Ouyang et al. 2022, “Training large language models to follow instructions with human feedback”

(5) Snorkel AI News