Snorkel AI hosted the 2023 installment of its Future of Data-Centric AI virtual conference in June. The two-day event brought together researchers, practitioners, and industry leaders to discuss the latest trends and advances in data-centric AI, and we recorded each session as a video.

We have now made all of the videos from the event publicly available on YouTube. You can find all of them collected in a playlist here, or scroll down for an embedded version of each, along with a description.

We hope that you will enjoy watching the videos and learning more about the impact of LLMs on the world.

Closing Keynote: LLMOps: Making LLM Applications Production-Grade

Large language models are fluent text generators, but they struggle at generating factual, correct content. How can we convert these capabilities into reliable, production-grade applications? In this talk, Matei Zaharia, Co-Founder and Chief Technologist at Databricks, covers several techniques to do this based on his work and experience at Stanford and Databricks. On the research side, he and his team have been developing programming frameworks such as Demonstrate-Search-Predict (DSP) that reliably connect an LLM to factual information and automatically improve the app’s performance over time. On the industry side, Databricks has been building a stack of simple yet powerful tools for “LLMOps” into the MLflow open-source framework.

Data-Driven Government: A Fireside Chat with the Former U.S. Chief Data Scientist

In this fireside chat as Snorkel AI CEO and co-founder Alex Ratner and DJ Patil, the Former U.S. Chief Data Scientist dive into data science’s history, impact, and challenges in the United States government. Discover the strategies used to drive data-driven decisions within the complex governmental landscape and gain valuable perspectives on the future of AI/ML, the ethical considerations in data science, and the transformative potential of leveraging data to better society.

Lessons From a Year With Snorkel: Data-Centric Workflows with SMEs at Georgetown University

When the Georgetown University Center for Security and Emerging Technology began experimenting with Snorkel, it had two high-level goals. The center aimed to address recurring bottlenecks in their ML projects and improve collaborative workflows between data scientists and subject-matter experts. In this presentation, center NLP Engineer James Dunham shares takeaways from the half-dozen project teams who used Snorkel in the past year. He identifies friction points in adoption, summarizes feedback from SMEs, and discusses which challenges Snorkel has helped his team address, and which ones remain.

Panel Discussion: The Linux Moment of AI: Open Sourced AI Stack

In this panel, seasoned experts Julien Simon (Hugging Face), Ed Shee (Seldon), and Travis Addair (Predibase) delve into how open-source models and tools can revolutionize AI. Julien sheds light on projects like Big Science and explore how open-source projects can lead to a more adaptable AI stack, empowering developers to create use-case-specific solutions. With his vast experience in deploying and monitoring AI systems, Ed discusses how open-source aids these processes and the challenges and potential solutions when scaling these systems. Meanwhile, Travis shares insights from his work, demonstrating how open-source innovation fosters faster, easier, and more collaborative development. As we witness the evolution of applications like ChatGPT, our panelists discuss the open-source community’s crucial role in steering future developments and ensuring the ethical and responsible use of such technologies.

Leveraging Foundation Models and LLMs for Enterprise-Grade NLP

In recent years, large language models (LLMs) have shown tremendous potential in solving natural language processing (NLP) problems. However, deploying LLMs in enterprise comes with its own set of challenges, especially when it comes to adapting the models to customer-specific data and incorporating domain knowledge. In this talk, Kristina Liapchin, Lead Product Manager for Snorkel, explores how Snorkel AI can help address these challenges and enable businesses to leverage LLMs to extract insights from text data. She walks through how Snorkel Flow can enable businesses to drive value from LLMs today, making the most of enterprise-grade NLP.

Comcast SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale

End-to-end automatic speech recognition systems represent the state of the art, but they rely on thousands of hours of manually annotated speech for training, as well as heavyweight computation for inference. Of course, this impedes commercialization since most companies lack vast human and computational resources. In this presentation, Raphael Tang, Lead Research Scientist at Comcast Applied AI, explores training and deploying an ASR system in the label-scarce, compute-limited setting. To reduce human labor, he and his team use a third-party ASR system as a weak supervision source, supplemented with Snorkel labeling functions derived from implicit user feedback. To accelerate inference, they proposed to route production-time queries across a pool of CUDA graphs of varying input lengths, the distribution of which best matches the traffic’s. Compared to a third-party ASR, they achieved a relative improvement in word-error rate of 8% and a speedup of 600%. Their system, called SpeechNet, currently serves 12 million queries per day on their voice-enabled smart television.

Leveraging Data-Centric AI for Document Intelligence and PDF Extraction

Extracting entities from semi-structured documents is often a challenging task, requiring complex and time-consuming manual processes. In this session, Snorkel AI ML Engineer Ashwini Ramamoorthy explores how data-centric AI can be leveraged to simplify and streamline this process. She starts by discussing the challenges associated with extracting from PDFs and other semi-structured documents. She then explores how they can be overcome using Snorkel’s data-centric approach. Finally, she dives into how foundation models can be utilized to further accelerate development of these extraction models.

Applying Weak Supervision and Foundation Models for Computer Vision

In this session, Snorkel’s own ML Research Scientist Ravi Teja Mullapudi explores the latest advancements in computer vision that enable data-centric image classification model development. He showcases how visual prompts and fast parameter-efficient models built on top of foundation models provide immediate feedback to rapidly iterate on data quality and model performance resulting in significant time-savings and performance improvements. Moreover, he delves into the importance of adapting model representations via large-scale fine-tuning on weakly labeled data to address the limitations of fast but small models trained on fixed features.

Finally, he discusses the necessary scaling and model adaptations needed to transition from image-level classification to object-level detection and segmentation. Overall, Ravi aims to provide insights into how computer vision data and models can be effectively improved in tandem and adjusted for downstream applications.

Poster Competition: JoinBoost: Tree Training with Just SQL

Data and machine learning (ML) are crucial for enterprise operations. Enterprises store data in databases for management and use ML to gain business insights. However, there is a mismatch between the way ML expects data to be organized (a single table) and the way data is organized in databases (a join graph of multiple tables). Current specialized ML libraries (e.g., LightGBM, XGBoost) necessitate data denormalization, data export, and data import, as they operate as separate programs incompatible with databases. The existing method not only increases operational complexity but also faces scalability limitations, slower performance, and security risks. But what if there was a way to achieve competitive tree training performance with just SQL?

Columbia PhD student Zachary Huang presents JoinBoost, a lightweight Python library that transforms tree training algorithms over normalized databases into pure SQL queries. Compatible with any DBMS and data stack, JoinBoost is a simplified, all-in-one data stack solution that avoids data denormalization, export, and import. JoinBoost delivers exceptional performance and scalability tailored to the capabilities of the underlying DBMS. Huang and his colleagues’ experiments reveal that JoinBoost is 3x (1.1x) faster for random forests (gradient boosting) when compared to LightGBM, and scales well beyond LightGBM in terms of features, DB size, and join graph complexity.

Poster Competition: Data-IQ: Characterize and Audit Your Training Data with Two Lines of Code!

High model performance, on average, can hide that models may systematically underperform on subgroups of the data. To tackle this, Nabeel Seedat, PhD student at the University of Cambridge, proposes Data-IQ: a framework to systematically stratify examples into subgroups with respect to their outcomes, allowing users to audit their tabular, image, or text data with just two lines of extra code!

Seedat and his colleagues do this by analyzing the behavior of individual examples during training, based on their predictive confidence and, importantly, the aleatoric (data) uncertainty. Capturing the aleatoric uncertainty permits a principled characterization and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard). They show that Data-IQ’s characterization of examples is most robust to variation across similarly performant (yet different models), compared to baselines. Since Data-IQ can be used with any ML model (including neural networks, gradient boosting etc.), this property ensures consistency of data characterization, while allowing flexible model selection. Taking this a step further, Seedat demonstrates that the subgroups enable them to construct new approaches to both feature acquisition and dataset selection. Furthermore, he highlights how the subgroups can inform reliable model usage, noting the significant impact of the Ambiguous subgroup on model generalization.

Combining Domain Knowledge with Data to Track and Predict Heavy-Equipment Service Events

In this talk, Senior Data Scientist at Caterpillar, Davide Gerbaudo, illustrates how a century-old company like Caterpillar combines its domain knowledge with data to develop modern analytics that provides value to the enterprise, its dealership network, and its customers. In particular, he describes how domain expertise and data are used to classify and predict repair events of heavy equipment.

Data-Driven AI for Threat Detection

Network security has been a complex area in which to apply traditional machine learning. The number of possible threats is vast, but at the same time, the number of labeled attack samples is very small. Moreover, when enough sample data is collected for a particular type of threat, the threat-vector changes.

While collecting samples for the true positives is difficult, security analysts usually have good mental heuristics about how the threats behave. They manually “execute” the heuristics to identify the threat among the massive network data. Typically these heuristics are applied after the unsupervised techniques identify the anomalies and outliers in the data. While this works well in practice, the approach is computationally expensive due to the very nature of the unsupervised algorithms and with unpredictable accuracy in the field.

Weak supervision provides an alternative approach to utilizing the heuristics to identify the threats. It allows us to push the heuristics to the raw data to help us build more efficient models with predictable accuracy. In this talk, Arista Distinguished Data Scientist Debabrata Dash discusses one prototype for using weak supervision in the cyber-security domain, with exciting results.

Tackling Advanced Classification Using Snorkel Flow

In this talk, Snorkel’s Staff Product Designer Angela Fox and Director of Product/Founding Engineer Vincent Chen discuss the key challenges and approaches for productionizing classification models in the age of foundation models. To start, they highlight common but underrated challenges related to label schema definition, high cardinality, and multi-label problem formulations. They dive into specific user experiences in Snorkel Flow to overcome these challenges, including ways to leverage foundation models, targeted error analysis, and supervision from subject-matter experts. Finally, they zoom out with a few case studies to describe how enterprise teams leverage data-centric workflows to build highly quality production models and unblock previously untenable problems in Snorkel Flow.

Fireside Chat: Journey of Data: Transforming the Enterprise with Data-Centric Workflows

Join Nurtekin Savas, Head of Enterprise Data Science at Capital One, and Snorkel CEO and Co-Founder Alex Ratner in an insightful exploration of the data’s journey across an enterprise. Nurtekin unravels how data, from its creation to its ultimate insights, navigates and transforms within the complex enterprise stack. He spotlights the power of data-centric workflows and their crucial role in driving business decisions, improving operational efficiency, and fueling AI innovation.

The Future is Neurosymbolic

Yoav Shoham, Co-Founder of AI21 Labs, gave a wide-ranging talk on the future of large language models and generative AI. He discussed the challenges and limitations of AI language models, particularly large models like GPT-3 and GPT-4, and emphasized the importance of addressing the downside of AI models, including fake facts and fake reasoning.

DataComp: In Search of the Next Generation of Multimodal Datasets

Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, Ludwig Schmidt, Assistant Professor of Computer Science at the University of Washington, introduces DataComp, a benchmark where the training code is fixed and researchers innovate by proposing new training sets.

Professor Schmidt and his colleagues provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in their benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running their standardized CLIP training code and testing on 38 downstream test sets. The DataComp benchmark consists of multiple scales, which facilitate the study of scaling trends and make the benchmark accessible to researchers with varying resources.

Their baseline experiments show that the DataComp workflow is a promising way of improving multimodal datasets. Schmidt and his team introduce a new dataset DataComp-1B and show that CLIP models trained on this dataset outperform OpenAI’s CLIP model by 3.7 percentage points on ImageNet while using the same compute budget. Compared to LAION-5B, their data improvement corresponds to a 9x improvement in compute cost.

Fireside Chat: Building RedPajama

Foundation models such as GPT-4 have driven rapid improvement in AI. However, the most powerful models are closed commercial models or only partially open. RedPajama is a project to create a set of leading, fully open-source models. In this session, Ce Zhang, Together CTO, and Braden Hancock, Snorkel AI Co-Founder and Head of Technology, discuss the data collection and training processes that went into building the RedPajama models.

Transforming the Customer Experience with AI: Wayfair’s Data-Centric Way

Wayfair’s Archana Sapkota, ML Manager, and Vinny DeGenova, Associate Director of ML, walk through the problems they solve at Wayfair using machine learning, which impacts all aspects of a customer’s journey. They provide insights on how they use ML to understand their customers as well as the products in their catalog. They also discuss some of the challenges they face in their space and how they are using ML best practices, state-of-the-art foundation models, and data-centric approaches to solve these problems.

One way they help their customers find products is by cleaning and enriching their catalog. Wayfair does this by automating image tagging using a data-centric approach. Archana and Vinny provide insights on how they have accomplished this and share their findings.

Finally, they touch on an important aspect of their approach: the collaboration between subject matter experts (SMEs) and data scientists (DS). By working closely together, the two groups are able to quickly iterate on model development and testing, ultimately leading to a faster time-to-market for the models Wayfair develops.

Generating Synthetic Tabular Data that is Differentially Private

While generative models are able to produce synthetic datasets that preserve the statistical qualities of the training dataset without identifying any particular record in the training dataset, most generative models to date do not offer mathematical guarantees of privacy that can be used to facilitate information sharing or publishing. Without such mathematical guarantees, each adversarial attack on these models and the synthetic data they generate needs to be thwarted reactively. We can never be aware of adversarial attacks that might become feasible in the future. This is exactly the problem that differential privacy (DP) solves by bounding the probability that a compromising event occurs.

By introducing calibrated noise into an algorithm, DP defends against all future privacy attacks with a high probability. In this session, Senior Applied Scientist for Gretel AI, Lipika Ramaswamy, explores approaches to applying differential privacy, including one that relies on measuring low dimensional distributions in a dataset combined with learning a graphical model representation. She ends with a preview of Gretel’s new generative model that applies this method to create high-quality synthetic tabular data that is differentially private.

Accelerating Information Extraction with Data-Centric Iteration

During this session, John Semerdjian, Snorkel’s Tech Lead Manager for Applied ML, and Vincent Chen, Founding Engineer and Director of Product, discuss practical workflows for building enterprise information extraction applications. They start with an end-to-end deep dive into “sequence tagging” tasks in Snorkel Flow, where they highlight how teams of data scientists and subject-matter experts can rapidly build powerful, zero-to-one models. In doing so, they cover the key annotation, error analysis, and model-guided iteration capabilities that have helped Snorkel’s customers unblock models that power high-value use cases in production. Finally, they discuss exciting opportunities for even further acceleration of these workflows in an FM-first world.

Fireside Chat: Alex Ratner and Gideon Mann on Building BloombergGPT

Gideon Mann, Head of Machine Learning Product and Research, CTO Office at Bloomberg, joins Alex Ratner for a conversation about how Mann and his team built a domain-specific LLM, BloombergGPT.

Fireside Chat: The Role of Data in Building Stable Diffusion and Generative AI

Discover the transformative power of data in developing Stable Diffusion and Generative AI, as the Founder and CEO of Stability AI, Emad Mostaque, chats with Alex Ratner and shares insights into the pivotal role data plays in creating these groundbreaking technologies. Explore the journey of leveraging data-driven approaches to drive innovation, unlock new possibilities, and shape the future of AI.

The Opportunity of Data-Centric AI in Insurance

Alejandro Zarate Santovena, Lecturer at Columbia University and Managing Director at Marsh. He discussed the growing importance of AI adoption in the insurance industry due to the increasing demand for quick insights and the acceleration of data-driven decision-making. He also delved into AI use cases in insurance, such as analytics, claims processing, underwriting, product detection, and improving customer experiences.

Learn how to get more value from your PDF documents!

Transforming unstructured data such as text and documents into structured data is crucial for enterprise AI development. On December 17, we’ll hold a webinar that explains how to capture SME domain knowledge and use it to automate and scale PDF classification and information extraction tasks.

Sign up here!