Image
author

Fred Sala

Chief Scientist
,
Snorkel AI
Assistant Professor @ University of Wisconsin-Madison

Frederic Sala is Chief Scientist at Snorkel AI and an assistant professor in the Computer Sciences Department at the University of Wisconsin-Madison. His research studies the fundamentals of data-driven systems and machine learning, with a focus on data-centric AI, foundation models, and automated machine learning. He and his group received the 2024 DARPA Young Faculty Award, a best student paper runner-up award at UAI ’22, the outstanding Ph.D. dissertation award from the UCLA Department of Electrical Engineering, the NSF Graduate Research Fellowship.

The latest from Fred

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning,...
Research Paper
Accepted to MLSys 2026
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where…

Learn more about Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
Rubric-based evaluation is widely used in LLM benchmarks and training pipelines for open-ended, less verifiable tasks. While prior work has demonstrated the effectiveness of rubrics using downstream signals such as reinforcement learning outcomes, there remains no principled way to diagnose rubric quality issues from such aggregated or downstream signals alone. To address this gap, we introduce RIFT: RubrIc Failure mode Taxonomy, a taxonomy for systematically characterizing failure modes in rubric composition and design. RIFT consists of eight failure modes organized into three high-level categories: Reliability Failures, Content Validity Failures, and Consequential Validity Failures. RIFT is developed using grounded theory by...
Research Paper
Accepted to ICLR Brazil 2026
RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics

Rubric-based evaluation is widely used in LLM benchmarks and training pipelines for open-ended, less verifiable tasks. While prior work has demonstrated the effectiveness of rubrics using downstream signals such as reinforcement learning outcomes, there remains no principled way to diagnose rubric quality issues from such aggregated or downstream signals alone. To address this gap, we introduce RIFT: RubrIc Failure mode…

Learn more about RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
A chat with the Terminal-Bench team
Blog
A chat with the Terminal-Bench team

Snorkel Chief Scientist Fred Sala and Kobie Crawford chat with the Terminal-Bench team to unpack the design behind Terminal-Bench 2.0 and the new Harbor framework.

Nov 19, 2025
Learn more about A chat with the Terminal-Bench team
Beyond accuracy: Dissecting mathematical reasoning for LLMs under reinforcement learning
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks. Despite the substantial empirical gains demonstrated by RL-based training methods like GRPO, a granular understanding of why and how RL enhances performance is still lacking. To bridge this gap, we introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions: (1) plan following and execution, (2) knowledge integration, and (3) chain of subproblems. Using this framework, we gain insights beyond mere accuracy. For instance, providing models with explicit human-crafted, step-by-step plans can surprisingly degrade performance...
Research Paper
Beyond accuracy: Dissecting mathematical reasoning for LLMs under reinforcement learning

Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks. Despite the substantial empirical gains demonstrated by RL-based training methods like GRPO, a granular understanding of why and how RL enhances performance is still lacking. To bridge this gap, we introduce SPARKLE, a fine-grained analytic framework to dissect the effects of…

Nov 17, 2025
Jiayu Wang, Yifei Ming, Zixuan Ke, Caiming Xiong, Shafiq Joty, Aws Albarghouthi, Frederic Sala
Learn more about Beyond accuracy: Dissecting mathematical reasoning for LLMs under reinforcement learning
Automating benchmark design
The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space...
Research Paper
Automating benchmark design

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark…

Learn more about Automating benchmark design
Reference-specific unlearning metrics can hide the truth: A reality check
Evaluating the effectiveness of unlearning in large language models (LLMs) remains a key challenge, especially as existing metrics often rely on specific reference outputs. The widely used forget quality metric from the TOFU benchmark compares likelihoods over paraphrased answers but is highly sensitive to the choice of the reference answers, potentially obscuring whether a model has truly forgotten the targeted information. We argue that unlearning should instead be assessed via distributional equivalence---how closely an unlearned model aligns functionally with the retain-only model. To this end, we propose Functional Alignment for Distributional Equivalence (FADE), a novel distribution-level metric that compares two distributions of textual...
Research Paper
Reference-specific unlearning metrics can hide the truth: A reality check

Evaluating the effectiveness of unlearning in large language models (LLMs) remains a key challenge, especially as existing metrics often rely on specific reference outputs. The widely used forget quality metric from the TOFU benchmark compares likelihoods over paraphrased answers but is highly sensitive to the choice of the reference answers, potentially obscuring whether a model has truly forgotten the targeted information. We…

Sep 23, 2025
Sungjun Cho, Dasol Hwang, Frederic Sala, Sangheum Hwang, Kyunghyun Cho, Sungmin Cha
Learn more about Reference-specific unlearning metrics can hide the truth: A reality check
From many voices to one: Statistically principled aggregation of LLM judges
LLM-as-a-judge---often with multiple judges---is now the standard for scalable model evaluation, yet judge biases and correlations can amplify errors. We cast aggregation as inference in a latent-factor Markov random field that jointly models a latent true-quality variable, inter-judge correlations, and confounders (e.g., generation length). We address two key technical challenges---identifiability and learning a higher-rank latent structure---via CARE, a two-stage estimator that uses sparse+low-rank structure recovery and tensor decomposition to separate quality from spurious factors. This enables us to better understand the quality and behavior of judges, leading to improved evaluation capabilities. Empirically, it reduces aggregation error by up to 25.15% and seamlessly incorporates...
Research Paper
From many voices to one: Statistically principled aggregation of LLM judges

LLM-as-a-judge—often with multiple judges—is now the standard for scalable model evaluation, yet judge biases and correlations can amplify errors. We cast aggregation as inference in a latent-factor Markov random field that jointly models a latent true-quality variable, inter-judge correlations, and confounders (e.g., generation length). We address two key technical challenges—identifiability and learning a higher-rank latent structure—via CARE, a two-stage estimator that…

Sep 23, 2025
Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala
Learn more about From many voices to one: Statistically principled aggregation of LLM judges
Shrinking the generation-verification gap with weak verifiers
Verifiers can enhance language model (LM) performance by scoring and ranking a set of generated responses, but high-quality verifiers today are either unscalable (like human judges) or of limited practical use (such as formal proof tools like Lean). While LM-based judges and reward models serve as general-purpose verifiers, they still fall short of the performance levels achieved by oracle verifiers, which are perfectly accurate. To bridge this gap, the Weaver framework is introduced as a method for constructing a strong verifier by combining multiple weaker, imperfect ones. Weaver shows that weighted ensembles of verifiers, which traditionally depend on labeled data,...
Research Paper
Shrinking the generation-verification gap with weak verifiers

Verifiers can enhance language model (LM) performance by scoring and ranking a set of generated responses, but high-quality verifiers today are either unscalable (like human judges) or of limited practical use (such as formal proof tools like Lean). While LM-based judges and reward models serve as general-purpose verifiers, they still fall short of the performance levels achieved by oracle verifiers,…

Jul 30, 2025
Frederic Sala, et all.
Learn more about Shrinking the generation-verification gap with weak verifiers
Building the benchmark: inside our agentic insurance underwriting dataset
Blog
Building the benchmark: inside our agentic insurance underwriting dataset

In this post, we unpack how Snorkel built a realistic benchmark dataset to evaluate AI agents in commercial insurance underwriting. From expert-driven data design to multi-tool reasoning tasks, see how our approach surfaces actionable failure modes that generic benchmarks miss—revealing what it really takes to deploy AI in enterprise workflows.

Jul 10, 2025
Learn more about Building the benchmark: inside our agentic insurance underwriting dataset
1 2 6 7

For models that need to be right. Not just good enough.