Alex Ratner

On the Tradeoff of Intra-/Inter-class Diversity for Supervised Pre-training

Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that...

Research Paper

On the Tradeoff of Intra-/Inter-class Diversity for Supervised Pre-training

Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of…

Oct 20, 2023 •

J. Zhang et al.

Learn more about On the Tradeoff of Intra-/Inter-class Diversity for Supervised Pre-training

MaskSearch: Querying Image Masks at Scale

Machine learning tasks over image databases often generate masks that annotate image content (e.g., saliency maps, segmentation maps) and enable a variety of applications (e.g., determine if a model is learning spurious correlations or if an image was maliciously modified to mislead a model). While queries that retrieve examples based on mask properties are valuable to practitioners, existing systems do not support such queries efficiently. In this paper, we formalize the problem and propose a system, MaskSearch, that focuses on accelerating queries over databases of image masks. MaskSearch leverages a novel indexing technique and an efficient filter-verification query execution framework....

Research Paper

MaskSearch: Querying Image Masks at Scale

Machine learning tasks over image databases often generate masks that annotate image content (e.g., saliency maps, segmentation maps) and enable a variety of applications (e.g., determine if a model is learning spurious correlations or if an image was maliciously modified to mislead a model). While queries that retrieve examples based on mask properties are valuable to practitioners, existing systems do…

Oct 20, 2023 •

D. He, et al.

Learn more about MaskSearch: Querying Image Masks at Scale

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLMgenerated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings...

Research Paper

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLMgenerated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a)…

Oct 20, 2023 •

CY. Hseih, et al.

Learn more about Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

DataComp: In search of the next generation of multimodal datasets

Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists...

Research Paper

DataComp: In search of the next generation of multimodal datasets

Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common…

Oct 20, 2023 •

SY. Gadre, et al.

Learn more about DataComp: In search of the next generation of multimodal datasets

Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision

The paper proposes a statistical label model called FABLE that incorporates instance features to improve the accuracy of inferred truth in Programmatic Weak Supervision (PWS). FABLE is built on a mixture of Bayesian label models, where the coefficients of the mixture components are predicted by a Gaussian Process classifier based on instance features.

Research Paper

Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision

The paper proposes a statistical label model called FABLE that incorporates instance features to improve the accuracy of inferred truth in Programmatic Weak Supervision (PWS). FABLE is built on a mixture of Bayesian label models, where the coefficients of the mixture components are predicted by a Gaussian Process classifier based on instance features.

Aug 02, 2023 •

J. Zhang et al.

Learn more about Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform...

Research Paper

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation…

Jun 28, 2023 •

Y. Yu, et al.

Learn more about Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Binary Classification with Positive Labeling Sources

This paper demonstrates that WEAPO, a Weak Supervision method for binary classification tasks with only positive labeling sources, is effective and efficient—achieving the highest performance of the tested Weak Supervision approaches in terms of label quality and final classifier accuracy on 10 benchmark datasets.

Research Paper

Binary Classification with Positive Labeling Sources

This paper demonstrates that WEAPO, a Weak Supervision method for binary classification tasks with only positive labeling sources, is effective and efficient—achieving the highest performance of the tested Weak Supervision approaches in terms of label quality and final classifier accuracy on 10 benchmark datasets.

Mar 15, 2023 •

J. Zhang, et al.

Learn more about Binary Classification with Positive Labeling Sources

Understanding Programmatic Weak Supervision via Source-aware Influence Function

This paper proposes source-aware variation of Influence Function, which measures the influence of individual components in the Programmatic Weak Supervision pipeline, and can be used for multiple purposes such as understanding incorrect predictions, identifying mislabeling of sources, and improving the end model's generalization performance.

Research Paper

Understanding Programmatic Weak Supervision via Source-aware Influence Function

This paper proposes source-aware variation of Influence Function, which measures the influence of individual components in the Programmatic Weak Supervision pipeline, and can be used for multiple purposes such as understanding incorrect predictions, identifying mislabeling of sources, and improving the end model’s generalization performance.

Mar 15, 2023 •

J. Zhang, et al

Learn more about Understanding Programmatic Weak Supervision via Source-aware Influence Function

Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming

This paper presents Nemo, an interactive system that improves the overall productivity of Weak Supervision learning pipelines by an average of 20%, compared to the prevailing WS approach.

Research Paper

Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming

This paper presents Nemo, an interactive system that improves the overall productivity of Weak Supervision learning pipelines by an average of 20%, compared to the prevailing WS approach.

Mar 15, 2023 •

C. Hsieh, et al

Learn more about Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming

Alex Ratner

The latest from Alex

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?