Fred Sala

Shrinking the generation-verification gap with weak verifiers

Verifiers can enhance language model (LM) performance by scoring and ranking a set of generated responses, but high-quality verifiers today are either unscalable (like human judges) or of limited practical use (such as formal proof tools like Lean). While LM-based judges and reward models serve as general-purpose verifiers, they still fall short of the performance levels achieved by oracle verifiers, which are perfectly accurate. To bridge this gap, the Weaver framework is introduced as a method for constructing a strong verifier by combining multiple weaker, imperfect ones. Weaver shows that weighted ensembles of verifiers, which traditionally depend on labeled data,...

Research Paper

Shrinking the generation-verification gap with weak verifiers

Verifiers can enhance language model (LM) performance by scoring and ranking a set of generated responses, but high-quality verifiers today are either unscalable (like human judges) or of limited practical use (such as formal proof tools like Lean). While LM-based judges and reward models serve as general-purpose verifiers, they still fall short of the performance levels achieved by oracle verifiers,…

Jul 30, 2025 •

Frederic Sala, et all.

Learn more about Shrinking the generation-verification gap with weak verifiers

Blog

Building the benchmark: inside our agentic insurance underwriting dataset

In this post, we unpack how Snorkel built a realistic benchmark dataset to evaluate AI agents in commercial insurance underwriting. From expert-driven data design to multi-tool reasoning tasks, see how our approach surfaces actionable failure modes that generic benchmarks miss—revealing what it really takes to deploy AI in enterprise workflows.

Jul 10, 2025 •

Chris Glaze, Fred Sala

Learn more about Building the benchmark: inside our agentic insurance underwriting dataset

Weak-to-strong generalization through the data-centric lens

The weak-to-strong generalization phenomenon is the driver for important machine learning applications including highly data-efficient learning and, most recently, performing superalignment. While decades of research have resulted in numerous algorithms that produce strong empirical performance, understanding what aspects of data enable weak-to-strong generalization has been understudied. We propose a simple data-centric mechanism that characterizes weak-to-strong generalization: the overlap density. Intuitively, generalization tracks the number of points that contain overlaps, i.e., both easy patterns (learnable by a weak model) and challenging patterns (only learnable by a stronger model), as with such points, weak predictions can be used to learn challenging patterns...

Research Paper

Weak-to-strong generalization through the data-centric lens

The weak-to-strong generalization phenomenon is the driver for important machine learning applications including highly data-efficient learning and, most recently, performing superalignment. While decades of research have resulted in numerous algorithms that produce strong empirical performance, understanding what aspects of data enable weak-to-strong generalization has been understudied. We propose a simple data-centric mechanism that characterizes weak-to-strong generalization: the overlap density. Intuitively,…

Mar 01, 2025 •

Changho Shin, John Cooper, Frederic Sala Department of Computer Science University of Wisconsin-Madison

Learn more about Weak-to-strong generalization through the data-centric lens

Theoretical Physics Benchmark (TPBench)—a dataset and study of AI reasoning capabilities in theoretical physics

We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen. While we find impressive progress in model performance with the most recent models, our research-level difficulty problems are mostly unsolved. We...

Research Paper

Theoretical Physics Benchmark (TPBench)—a dataset and study of AI reasoning capabilities in theoretical physics

We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data…

Feb 01, 2025 •

Daniel J.H. Chung, Zhiqi Gao, Yurii Kvasiuk, Tianyi Li, Moritz Munchmeyer, Maja Rudolph, Frederic Sala, and Sai Chaitanya Tadepalli

Learn more about Theoretical Physics Benchmark (TPBench)—a dataset and study of AI reasoning capabilities in theoretical physics

Zero-Shot Robustification of Zero-Shot Models with Foundation Models

Zero-shot inference is a powerful paradigm that enables the use of large pretrained models for downstream classification tasks without further training. However, these models are vulnerable to inherited biases that can impact their performance. The traditional solution is fine-tuning, but this undermines the key advantage of pretrained models, which is their ability to be used out-of-the-box. We propose ROBOSHOT, a method that improves the robustness of pretrained model embeddings in a fully zero-shot fashion. First, we use zero-shot language models (LMs) to obtain useful insights from task descriptions. These insights are embedded and used to remove harmful and boost useful...

Research Paper

Zero-Shot Robustification of Zero-Shot Models with Foundation Models

Zero-shot inference is a powerful paradigm that enables the use of large pretrained models for downstream classification tasks without further training. However, these models are vulnerable to inherited biases that can impact their performance. The traditional solution is fine-tuning, but this undermines the key advantage of pretrained models, which is their ability to be used out-of-the-box. We propose ROBOSHOT, a…

Sep 18, 2024 •

D. Adila, et al.

Learn more about Zero-Shot Robustification of Zero-Shot Models with Foundation Models

The ALCHEmist: automated labeling 500x cheaper than LLM data annotators

Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to...

Research Paper

The ALCHEmist: automated labeling 500x cheaper than LLM data annotators

Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather…

Sep 18, 2024 •

TH. Huang, et al.

Learn more about The ALCHEmist: automated labeling 500x cheaper than LLM data annotators

Product Manifold Representations for Learning on Biological Pathways

Machine learning models that embed graphs in non-Euclidean spaces have shown substantial benefits in a variety of contexts, but their application has not been studied extensively in the biological domain, particularly with respect to biological pathway graphs. Such graphs exhibit a variety of complex network structures, presenting challenges to existing embedding approaches. Learning high-quality embeddings for biological pathway graphs is important for researchers looking to understand the underpinnings of disease and train high-quality predictive models on these networks. In this work, we investigate the effects of embedding pathway graphs in nonEuclidean mixed-curvature spaces and compare against traditional Euclidean graph representation...

Research Paper

Product Manifold Representations for Learning on Biological Pathways

Machine learning models that embed graphs in non-Euclidean spaces have shown substantial benefits in a variety of contexts, but their application has not been studied extensively in the biological domain, particularly with respect to biological pathway graphs. Such graphs exhibit a variety of complex network structures, presenting challenges to existing embedding approaches. Learning high-quality embeddings for biological pathway graphs is…

Sep 18, 2024 •

D. McNeela, et al.

Learn more about Product Manifold Representations for Learning on Biological Pathways

Pretrained Hybrids with MAD Skills

While Transformers underpin modern large language models (LMs), there is a growing list of alternative architectures with new capabilities, promises, and tradeoffs. This makes choosing the right LM architecture challenging. Recently-proposed hybrid architectures seek a best-of-all-worlds approach that reaps the benefits of all architectures. Hybrid design is difficult for two reasons: it requires manual expert-driven search, and new hybrids must be trained from scratch. We propose Manticore, 1 a framework that addresses these challenges. Manticore automates the design of hybrid architectures while reusing pretrained models to create pretrained hybrids. Our approach augments ideas from differentiable Neural Architecture Search (NAS) by...

Research Paper

Pretrained Hybrids with MAD Skills

While Transformers underpin modern large language models (LMs), there is a growing list of alternative architectures with new capabilities, promises, and tradeoffs. This makes choosing the right LM architecture challenging. Recently-proposed hybrid architectures seek a best-of-all-worlds approach that reaps the benefits of all architectures. Hybrid design is difficult for two reasons: it requires manual expert-driven search, and new hybrids must…

Sep 18, 2024 •

N. Roberts, et al.

Learn more about Pretrained Hybrids with MAD Skills

Pearls from Pebbles: Improved Confidence Functions for Auto-labeling

Auto-labeling is an important family of techniques that produce labeled training sets with minimum manual labeling. A prominent variant, threshold-based auto-labeling (TBAL), works by finding a threshold on a model’s confidence scores above which it can accurately label unlabeled data points. However, many models are known to produce overconfident scores, leading to poor TBAL performance. While a natural idea is to apply off-the-shelf calibration methods to alleviate the overconfidence issue, such methods still fall short. Rather than experimenting with ad-hoc choices of confidence functions, we propose a framework for studying the optimal TBAL confidence function. We develop a tractable version...

Research Paper

Pearls from Pebbles: Improved Confidence Functions for Auto-labeling

Auto-labeling is an important family of techniques that produce labeled training sets with minimum manual labeling. A prominent variant, threshold-based auto-labeling (TBAL), works by finding a threshold on a model’s confidence scores above which it can accurately label unlabeled data points. However, many models are known to produce overconfident scores, leading to poor TBAL performance. While a natural idea is…

Sep 18, 2024 •

H. Vishwakarma, et al.

Learn more about Pearls from Pebbles: Improved Confidence Functions for Auto-labeling

Fred Sala

The latest from Fred

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?