Alex Ratner

Blog

Data-centric development of an enterprise AI agent with Snorkel

See how we can use these two new products—Snorkel Evaluate and Expert Data-as-a-Service–to evaluate and develop a specialized agentic AI system for an enterprise use case

May 29, 2025 •

Alex Ratner

Learn more about Data-centric development of an enterprise AI agent with Snorkel

Blog

Building the data development platform for specialized AI

Announcing two new products on our AI Data Development Platform that together create a complete solution for enterprises to specialize AI systems with expert data at scale.

May 29, 2025 •

Alex Ratner

Learn more about Building the data development platform for specialized AI

On the Tradeoff of Intra-/Inter-class Diversity for Supervised Pre-training

Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that...

Research Paper

On the Tradeoff of Intra-/Inter-class Diversity for Supervised Pre-training

Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of…

Sep 18, 2024 •

J. Zhang et al.

Learn more about On the Tradeoff of Intra-/Inter-class Diversity for Supervised Pre-training

Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-themiddle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between lost-in-the-middle to LLMs’ intrinsic attention bias: LLMs exhibit an U-shaped attention bias where the tokens at the beginning and at the end of its input receive higher attention, regardless of their relevance. Second, we mitigate this positional bias through...

Research Paper

Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-themiddle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection…

Sep 18, 2024 •

C. Hsieh, et al.

Learn more about Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

Blog

Walking safely before building flying saucer seatbelts: introducing Enterprise Alignment

Snorkel takes a step on the path to enterprise superalignment with new data development workflows for enterprise alignment

May 20, 2024 •

Alex Ratner, Tom Walshe, Chris Glaze, Fred Sala, Paroma Varma, Hoang Tran

Learn more about Walking safely before building flying saucer seatbelts: introducing Enterprise Alignment

Blog

Crossing the demo-to-production chasm with Snorkel Custom

We’re excited to announce Snorkel Custom to help enterprises cross the chasm from flashy chatbot demos to real production AI value.

Apr 11, 2024 •

Alex Ratner

Learn more about Crossing the demo-to-production chasm with Snorkel Custom

Blog

Enterprises must shift their focus from models to data in AI development

Snorkel AI CEO Alex Ratner explains his view on the importance of AI in data development and illustrates his position with two case studies.

Feb 09, 2024 •

Alex Ratner

Learn more about Enterprises must shift their focus from models to data in AI development

Characterizing the Impacts of Semi-supervised Learning for Weak Supervision

Labeling training data is a critical and expensive step in producing high accuracy ML models, whether training from scratch or fine-tuning. To make labeling more efficient, two major approaches are programmatic weak supervision (WS) and semi-supervised learning (SSL). More recent works have either explicitly or implicitly used techniques at their intersection, but in various complex and ad hoc ways. In this work, we define a simple, modular design space to study the use of SSL techniques for WS more systematically. Surprisingly, we find that fairly simple methods from our design space match the performance of more complex state-of-the-art methods, averaging...

Research Paper

Characterizing the Impacts of Semi-supervised Learning for Weak Supervision

Labeling training data is a critical and expensive step in producing high accuracy ML models, whether training from scratch or fine-tuning. To make labeling more efficient, two major approaches are programmatic weak supervision (WS) and semi-supervised learning (SSL). More recent works have either explicitly or implicitly used techniques at their intersection, but in various complex and ad hoc ways. In…

Jan 16, 2024 •

Jeffrey Li, Jieyu Zhang, Ludwig Schmidt & Alexander Ratner

Learn more about Characterizing the Impacts of Semi-supervised Learning for Weak Supervision

Tool documentation enables zero-shot tool-usage with large language models

Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool’s usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable. Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation—descriptions for the individual tool usage—over demonstrations....

Research Paper

Tool documentation enables zero-shot tool-usage with large language models

Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool’s usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and…

Oct 20, 2023 •

CY. Hseih, et al.

Learn more about Tool documentation enables zero-shot tool-usage with large language models

Alex Ratner

The latest from Alex

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?