Image
author

Alex Ratner

Co-Founder & CEO, Snorkel AI
Faculty, University of Washington

Alex Ratner is the co-founder and CEO at Snorkel AI, and an affiliate assistant professor of computer science at the University of Washington. Prior to Snorkel AI and UW, he completed his Ph.D. in computer science advised by Christopher Ré at Stanford, where he started and led the Snorkel open source project. His research focused on data-centric AI, applying data management and statistical learning techniques to AI data development and curation.

The latest from Alex

On the Tradeoff of Intra-/Inter-class Diversity for Supervised Pre-training
Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that...
Research Paper
On the Tradeoff of Intra-/Inter-class Diversity for Supervised Pre-training

Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of…

Oct 20, 2023
J. Zhang et al.
Learn more about On the Tradeoff of Intra-/Inter-class Diversity for Supervised Pre-training
MaskSearch: Querying Image Masks at Scale
Machine learning tasks over image databases often generate masks that annotate image content (e.g., saliency maps, segmentation maps) and enable a variety of applications (e.g., determine if a model is learning spurious correlations or if an image was maliciously modified to mislead a model). While queries that retrieve examples based on mask properties are valuable to practitioners, existing systems do not support such queries efficiently. In this paper, we formalize the problem and propose a system, MaskSearch, that focuses on accelerating queries over databases of image masks. MaskSearch leverages a novel indexing technique and an efficient filter-verification query execution framework....
Research Paper
MaskSearch: Querying Image Masks at Scale

Machine learning tasks over image databases often generate masks that annotate image content (e.g., saliency maps, segmentation maps) and enable a variety of applications (e.g., determine if a model is learning spurious correlations or if an image was maliciously modified to mislead a model). While queries that retrieve examples based on mask properties are valuable to practitioners, existing systems do…

Oct 20, 2023
D. He, et al.
Learn more about MaskSearch: Querying Image Masks at Scale
Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes
Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLMgenerated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings...
Research Paper
Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLMgenerated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a)…

Oct 20, 2023
CY. Hseih, et al.
Learn more about Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes
DataComp: In search of the next generation of multimodal datasets
Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists...
Research Paper
DataComp: In search of the next generation of multimodal datasets

Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common…

Oct 20, 2023
SY. Gadre, et al.
Learn more about DataComp: In search of the next generation of multimodal datasets
Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision
The paper proposes a statistical label model called FABLE that incorporates instance features to improve the accuracy of inferred truth in Programmatic Weak Supervision (PWS). FABLE is built on a mixture of Bayesian label models, where the coefficients of the mixture components are predicted by a Gaussian Process classifier based on instance features.
Research Paper
Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision

The paper proposes a statistical label model called FABLE that incorporates instance features to improve the accuracy of inferred truth in Programmatic Weak Supervision (PWS). FABLE is built on a mixture of Bayesian label models, where the coefficients of the mixture components are predicted by a Gaussian Process classifier based on instance features.

Aug 02, 2023
J. Zhang et al.
Learn more about Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform...
Research Paper
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation…

Jun 28, 2023
Y. Yu, et al.
Learn more about Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
Binary Classification with Positive Labeling Sources
This paper demonstrates that WEAPO, a Weak Supervision method for binary classification tasks with only positive labeling sources, is effective and efficient—achieving the highest performance of the tested Weak Supervision approaches in terms of label quality and final classifier accuracy on 10 benchmark datasets.
Research Paper
Binary Classification with Positive Labeling Sources

This paper demonstrates that WEAPO, a Weak Supervision method for binary classification tasks with only positive labeling sources, is effective and efficient—achieving the highest performance of the tested Weak Supervision approaches in terms of label quality and final classifier accuracy on 10 benchmark datasets.

Mar 15, 2023
J. Zhang, et al.
Learn more about Binary Classification with Positive Labeling Sources
Understanding Programmatic Weak Supervision via Source-aware Influence Function
This paper proposes source-aware variation of Influence Function, which measures the influence of individual components in the Programmatic Weak Supervision pipeline, and can be used for multiple purposes such as understanding incorrect predictions, identifying mislabeling of sources, and improving the end model's generalization performance.
Research Paper
Understanding Programmatic Weak Supervision via Source-aware Influence Function

This paper proposes source-aware variation of Influence Function, which measures the influence of individual components in the Programmatic Weak Supervision pipeline, and can be used for multiple purposes such as understanding incorrect predictions, identifying mislabeling of sources, and improving the end model’s generalization performance.

Mar 15, 2023
J. Zhang, et al
Learn more about Understanding Programmatic Weak Supervision via Source-aware Influence Function
Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming
This paper presents Nemo, an interactive system that improves the overall productivity of Weak Supervision learning pipelines by an average of 20%, compared to the prevailing WS approach.
Research Paper
Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming

This paper presents Nemo, an interactive system that improves the overall productivity of Weak Supervision learning pipelines by an average of 20%, compared to the prevailing WS approach.

Mar 15, 2023
C. Hsieh, et al
Learn more about Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming
1 2 5 6

For models that need to be right. Not just good enough.