We develop methods, benchmarks, and training systems that turn expert data into frontier AI

Browse research library

building benchmarks and collaborating with

from the lab

Featured research

NEW

Benchmark

Senior SWE-bench built by Snorkel with Princeton University and UW-Madison

Benchmark

Open Benchmark Grants

Continual Learning Bench: Evaluating agents that adapt and improve over time

Research Paper

Accepted to MLSys

Learning from Less: Measuring the Effectiveness of RLVR in Low Data Compute Regimes

Benchmark

Open Benchmarks Grants

Agents' Last Exam: Challenge and measure AI agents on economically valuable and real-world tasks

Benchmark

Open Benchmark Grants

Agentic Coding benchmark: Evaluating AI Models on complex, real-world coding tasks

key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

Benchmarking & Evaluation

Build benchmarks that define and advance the AI frontier

featured work

Senior SWE-bench
Built with Princeton & UW-Madison

OSWorld 2.0
Co-authored with XLANG Lab

Agents' Last Exam
Co-authored with UC Berkeley RDI

BigLaw Bench: Research
Co-released with Harvey

Scaling Subject Matter Expertise

Define how subject matter experts encode their knowledge into data

featured work

Weak-to-Strong Generalization Through Data-Centric Lens
ICLR 2025

Shrinking the Generation–Verification Gap with Weak Verifiers
NeurIPS 2025

Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision
UAI 2022, Best Paper Runner-Up

RL, Training, & Data Valuation

Drive dataset development based on feedback from RL and model training

featured work

Learning from Less: Effectiveness of RLVR in Low Data and Compute Regimes
MLSys 2026

4B FinQA Model Outperforms 235B Model with the Right Data
Co-authored with Berkeley

RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
ICLR Workshop 2026

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Open Benchmarks Grants

Backed by a $3M commitment, the program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Learn more

Benchtalks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.

Watch the latest episode

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach

Brown University

Eliot Horowitz Assistant Professor, Computer Science Department

Jason Fries

Stanford University

Assistant Professor of Biomedical Data Science and of Medicine

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup

Prev. Dir. of AI at DIU

Fred Sala

Chief Scientist

Snorkel AI

Assistant Professor @ University of Wisconsin-Madison

Chris Ré

Co-Founder

Snorkel AI

Professor @ Stanford University

Ludwig Schmidt

Stanford University · LAION

Stanford researcher and LAION collaborator

Karthik Narasimhan

Princeton University

Professor of Computer Science

Yu Su

Ohio State University

Associate Professor of Computer Science and Engineering

Lewis Tunstall

Hugging Face

Machine Learning Engineer

PUBLICATIONS

Browse research blogs and academic papers

Towards Curiosity-Driven Learning of Physical Dynamics

Throughout our lives, we as humans acquire an intuitive understanding of our physical environments, a capacity that supports our imagination and planning abilities. Driven by our own curiosity, we learn about object motion and properties via self-curated targeted experiments, that teach us what we do not know. Recently, neural network models have been proposed that learn forward object dynamics from observations like humans. Unlike humans, these models do not actively interact with surrounding objects but learn from human-curated datasets as passive observers. In this work-in-progress, we propose a closed-loop system that teaches itself about forward object dynamics without any human...

Research Paper

Towards Curiosity-Driven Learning of Physical Dynamics

Apr 26, 2020 •

MJ. Lingelbach, et al.

Learn more about Towards Curiosity-Driven Learning of Physical Dynamics

Weakly Supervised Sequence Tagging from Noisy Rules

We propose a framework for training sequence tagging models with weak supervision consisting of multiple heuristic rules of unknown accuracy. In addition to supporting rules that vote on tags in the output sequence, we introduce a new type of weak supervision, called linking rules, that vote on how sequence elements should be grouped into spans with the same tag. These rules are an alternative to candidate span generators that require significantly more human effort. To estimate the accuracies of the rules and combine their conflicting outputs into training data, we introduce a new type of generative model, linked hidden Markov...

Research Paper

Weakly Supervised Sequence Tagging from Noisy Rules

Apr 03, 2020 •

E. Safranchik, et al.

Learn more about Weakly Supervised Sequence Tagging from Noisy Rules

Weakly Supervised Classification of Aortic Valve Malformations Using Unlabeled Cardiac MRI Sequences

This work formalizes a deep learning baseline for aortic valve classification and outlines a general strategy for using weak supervision to train machine learning models using unlabeled medical images at scale.

Research Paper

Weakly Supervised Classification of Aortic Valve Malformations Using Unlabeled Cardiac MRI Sequences

Dec 20, 2019 •

J. Fries, et al, 2019

Learn more about Weakly Supervised Classification of Aortic Valve Malformations Using Unlabeled Cardiac MRI Sequences

Utilizing Weak Supervision to Infer Complex Objects in Autonomous Driving Data

While the detection and classification of simple objects encountered during autonomous driving sessions has been widely researched, the detection of complex objects and situations based on the combinations of objects in a scene remains relatively overlooked. This is especially difficult due to the cost of gathering labels for each complex scenario of interest before training a specialized model. To address this bottleneck of training data, we explore the applicability of weak supervision, or relying on higher level, noisier forms of supervision to label training data. Specifically, we use data programming, a paradigm that can learn the accuracy and dependency structure...

Research Paper

Utilizing Weak Supervision to Infer Complex Objects in Autonomous Driving Data

Dec 19, 2019 •

Z. Wheng, et al, 2019

Learn more about Utilizing Weak Supervision to Infer Complex Objects in Autonomous Driving Data

Training Complex Models with Multi-Task Weak Supervision

Proposing a framework for integrating and modeling such weak supervision sources by viewing them as labeling different related sub-tasks of a problem, which we refer to as the multi-task weak supervision setting

Research Paper

Training Complex Models with Multi-Task Weak Supervision

Dec 18, 2019 •

A. Ratner, et al, 2019

Learn more about Training Complex Models with Multi-Task Weak Supervision

The Role of Massively Multi-Task and Weak Supervision in Software 2.0

Outlining a vision for a Software 2.0 lifecycle centered around the idea that labeling training data can be the primary interface to Software 2.0 systems.

Research Paper

The Role of Massively Multi-Task and Weak Supervision in Software 2.0

Outlining a vision for a Software 2.0 lifecycle centered around the idea that labeling training data can be the primary interface to Software 2.0 systems.

Dec 17, 2019 •

A. Ratner, et al, 2019

Learn more about The Role of Massively Multi-Task and Weak Supervision in Software 2.0

Snuba: Automating Weak Supervision to Label Training Data

As deep learning models are applied to increasingly diverse problems, a key bottleneck is gathering enough high-quality training labels tailored to each task. Users therefore turn to weak supervision, relying on imperfect sources of labels like pattern matching and user-defined heuristics. Unfortunately, users have to design these sources for each task. This process can be time consuming and expensive: domain experts often perform repetitive steps like guessing optimal numerical thresholds and developing informative text patterns. To address these challenges, we present Snuba, a system to automatically generate heuristics using a small labeled dataset to assign training labels to a large,...

Research Paper

Snuba: Automating Weak Supervision to Label Training Data

Dec 16, 2019 •

P. Varma and C. Ré, 2019

Learn more about Snuba: Automating Weak Supervision to Label Training Data

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

This is first-of-its-kind study showing how existing knowledge resources from across an organization can be used as weak supervision in order to bring development time and cost down by an order of magnitude, and introduce Snorkel DryBell, a new weak supervision management system for this setting

Research Paper

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Dec 15, 2019 •

S. Bach, et al, 2019

Learn more about Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Slice-Based Learning: A Programming Model for Residual Learning

In real-world machine learning applications, data subsets correspond to especially critical outcomes: vulnerable cyclist detections are safety-critical in an autonomous driving task, and "question" sentences might be important to a dialogue agent's language understanding for product purposes. While machine learning models can achieve quality performance on coarse-grained metrics like F1-score and overall accuracy, they may underperform on these critical subsets---we define these as slices, the key abstraction in our approach. To address slice-level performance, practitioners often train separate "expert" models on slice subsets or use multi-task hard parameter sharing. We propose Slice-based Learning, a new programming model in which the...

Research Paper

Slice-Based Learning: A Programming Model for Residual Learning

Dec 14, 2019 •

V. Chen, et al, 2019

Learn more about Slice-Based Learning: A Programming Model for Residual Learning