The founding team of Snorkel AI has spent over half a decade—first at the Stanford AI Lab and now at Snorkel AI—researching data-centric techniques to overcome the biggest bottleneck in AI: The lack of labeled training data. In this video Snorkel AI co-founder Paroma Varma gives an overview of data-centric AI and the key principles of data-centric AI development.
What is data-centric AI?
Machine learning by definition is about data and always has been, but only recently, with the development of powerful push-button models have data science teams shifted their focus to the data. This process, known as data-centric AI, is all about iterating and collaborating on the data used use to build AI systems and doing so programmatically.
But what is the reason for the industry’s increasing focus on data-centricity? A good way to answer that is to start by contrasting data-centric AI with what has been the focus of machine-learning development for many years: model-centric AI.
Traditionally, data science and machine learning teams have focused on model development by iterating on things like feature engineering, algorithm design, and bespoke model architecture. They treat the data as a static artifact and the bulk of the team’s focus is on the model itself.
But as models have become more sophisticated and push-button, AI teams are quickly realizing that focusing on data iteration is as crucial, if not more so, to successfully and rapidly develop and deploy high accuracy models.
Today, machine learning models have simultaneously grown more complex and opaque, and they require much higher volumes of training data. In fact, data has become a practical interface used to collaborate with subject matter experts and turn their knowledge into software. Ultimately, data-centric AI unlocks a higher degree of model accuracy than was possible using model-centric approaches alone.
Data-centric AI vs model-centric AI
The tectonic shift to a data-centric approach is as much a shift in focus of the machine-learning community and culture as a technological or methodological shift—“data-centric” in this sense means you are now spending time on labeling, managing, slicing, augmenting, and curating the data efficiently, with the model itself relatively more fixed.
It is also important to stress that this is not an either/or binary between data-centric and model-centric approaches. Successful AI requires both well-conceived models and good data.
Key principles of data-centric AI
01
02
03
Benefits of data-centric AI
01
02
03
Use cases of data-centric AI
More recently, the industry has exhibited a major shift toward much more powerful, automated, but also data-hungry machine-learning models. Rather than, say, thousands of free parameters that need to be learned from your data, there are sometimes hundreds of millions. So, despite their power and utility, these models need a great deal more labeled training data to reach their peak level of performance.
A data-centric AI approach has been applied successfully to numerous types of machine learning applications, ranging from classification of and extraction from text, PDFs, HTML, images, time-series, and more.
01 Training data is complex and requires subject matter experts to label.
02 Production data or business objectives change frequently requiring models to adapt via retraining.
03 Training data is private and outsourcing the labeling task is not an option.
A data-centric AI approach has been applied successfully to numerous types of machine learning applications, ranging from classification of and extraction from text, PDFs, HTML, images, time-series, and more.
Data-centric AI as implemented by Snorkel AI is particularly useful for situations where:
- Training data is complex and requires subject matter experts to label.
- Production data or business objectives change frequently requiring models to adapt via retraining.
- Training data is private and outsourcing the labeling task is not an option.
Snorkel Flow: A data-centric AI application development platform
Other Resources
Key research on data-centric AI
- Data Programming: Creating Large Training Sets, Quickly, Alex Ratner 2016 https://arxiv.org/pdf/1605.07723.pdf
Further reading on data-centric AI
Dive deeper into data-centric AI
Watch the complete recording of The Future of Data-centric AI, a full-day virtual event that brought together experts on data-centric AI from academia, research, and industry to explore the shift from a model-centric practice to a data-centric approach to building AI. Watch the on-demand event to discover solutions and ideas to make AI practical, both now and in the future.