Braden Hancock

Systems and methods for programmatic labeling of training data for machine learning models via clustering and language model prompting

Embodiments introduce an approach to semi-automatically generate labels for data based on implementation of a clustering or language model prompting technique and can be used to implement a form of programmatic labeling to accelerate the development of classifiers and other forms of models. The disclosed methodology is particularly helpful in generating labels or annotations for unstructured data. In some embodiments, the disclosed approach may be used with data in the form of text, images, or other form of unstructured data.

Research Paper

Systems and methods for programmatic labeling of training data for machine learning models via clustering and language model prompting

Embodiments introduce an approach to semi-automatically generate labels for data based on implementation of a clustering or language model prompting technique and can be used to implement a form of programmatic labeling to accelerate the development of classifiers and other forms of models. The disclosed methodology is particularly helpful in generating labels or annotations for unstructured data. In some embodiments,…

Sep 23, 2024 •

RN Smith, et all.

Learn more about Systems and methods for programmatic labeling of training data for machine learning models via clustering and language model prompting

The Llama 3 Herd of Models

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B...

Research Paper

The Llama 3 Herd of Models

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents…

Sep 18, 2024 •

A. Dubey, et al.

Learn more about The Llama 3 Herd of Models

Language Models in the Loop: Incorporating Prompting into Weak Supervision

We propose a new strategy for applying large pre-trained language models to novel tasks when labeled training data is limited. Rather than apply the model in a typical zero-shot or few-shot fashion, we treat the model as the basis for labeling functions in a weak supervision framework. To create a classifier, we first prompt the model to answer multiple distinct queries about an example and define how the possible responses should be mapped to votes for labels and abstentions. We then denoise these noisy label sources using the Snorkel system and train an end classifier with the resulting training data....

Research Paper

Language Models in the Loop: Incorporating Prompting into Weak Supervision

We propose a new strategy for applying large pre-trained language models to novel tasks when labeled training data is limited. Rather than apply the model in a typical zero-shot or few-shot fashion, we treat the model as the basis for labeling functions in a weak supervision framework. To create a classifier, we first prompt the model to answer multiple distinct…

Aug 22, 2024 •

R. Smith et al.

Learn more about Language Models in the Loop: Incorporating Prompting into Weak Supervision

DMLR: Data-centric Machine Learning Research-Past, Present and Future

Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.

Research Paper

DMLR: Data-centric Machine Learning Research-Past, Present and Future

Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods…

Nov 21, 2023 •

L. Oala, et al.

Learn more about DMLR: Data-centric Machine Learning Research-Past, Present and Future

Blog

Better not bigger: How to get GPT-3 quality at 0.1% the cost

We created Data-centric Foundation Model Development to bridge the gaps between foundation models and enterprise AI. New Snorkel Flow capabilities (Foundation Model Fine-tuning, Warm Start, and Prompt Builder) give data science and machine learning teams the tools they need to effectively put foundation models (FMs) to use for performance-critical enterprise use cases. The need is clear: despite undeniable excitement about…

Nov 17, 2022 •

Stephen Bach, Jason Fries, Braden Hancock

Learn more about Better not bigger: How to get GPT-3 quality at 0.1% the cost

Blog

ICLR 2022 recap from Snorkel AI

We are honored to be part of the International Conference on Learning Representations (ICLR) 2022, where Snorkel AI founders and researchers will be presenting five papers on data-centric AI topics The field of artificial intelligence moves fast! This is a world we are intimately familiar with at Snorkel AI, having spun out of academia in 2019. For over half a…

Apr 20, 2022 •

Braden Hancock

Learn more about ICLR 2022 recap from Snorkel AI

Blog

Making Automated Data Labeling a Reality in Modern AI

Moving from Manual to Programmatic Labeling Labeling training data by hand is exhausting. It’s tedious, slow, and expensive—the de facto bottleneck most AI/ML teams face today 1. Eager to alleviate this pain point of AI development, machine learning practitioners have long sought ways to automate this labor-intensive labeling process (i.e., “automated data labeling”) 2, and have reached for classic approaches…

Feb 04, 2022 •

Braden Hancock

Learn more about Making Automated Data Labeling a Reality in Modern AI

Blog

How to Use Snorkel to Build AI Applications

The how, what, and why of Snorkel’s programmatic data labeling approach and the state-of-the-art Snorkel Flow platform. The year was 2015. For the first time, machine learning (ML) had outperformed humans in the annual ImageNet challenge.

Jul 09, 2021 •

Braden Hancock

Learn more about How to Use Snorkel to Build AI Applications

Blog

3 Impractical Assumptions About AI to Avoid

Impractical ML assumptions are made every day in research, which limit its adoption. In the real world, these assumptions do not hold up. Learn more about how to avoid making these assumptions about AI application development.

May 04, 2021 •

Braden Hancock

Learn more about 3 Impractical Assumptions About AI to Avoid

Braden Hancock

The latest from Braden

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?