We develop methods, benchmarks, and training systems that turn expert data into frontier AI

Browse research library

building benchmarks and collaborating with

from the lab

Featured research

NEW

Benchmark

Senior SWE-bench built by Snorkel with Princeton University and UW-Madison

Benchmark

Open Benchmark Grants

Continual Learning Bench: Evaluating agents that adapt and improve over time

Research Paper

Accepted to MLSys

Learning from Less: Measuring the Effectiveness of RLVR in Low Data Compute Regimes

Benchmark

Open Benchmarks Grants

Agents' Last Exam: Challenge and measure AI agents on economically valuable and real-world tasks

Benchmark

Open Benchmark Grants

Agentic Coding benchmark: Evaluating AI Models on complex, real-world coding tasks

key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

Benchmarking & Evaluation

Build benchmarks that define and advance the AI frontier

featured work

Senior SWE-bench
Built with Princeton & UW-Madison

OSWorld 2.0
Co-authored with XLANG Lab

Agents' Last Exam
Co-authored with UC Berkeley RDI

BigLaw Bench: Research
Co-released with Harvey

Scaling Subject Matter Expertise

Define how subject matter experts encode their knowledge into data

featured work

Weak-to-Strong Generalization Through Data-Centric Lens
ICLR 2025

Shrinking the Generation–Verification Gap with Weak Verifiers
NeurIPS 2025

Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision
UAI 2022, Best Paper Runner-Up

RL, Training, & Data Valuation

Drive dataset development based on feedback from RL and model training

featured work

Learning from Less: Effectiveness of RLVR in Low Data and Compute Regimes
MLSys 2026

4B FinQA Model Outperforms 235B Model with the Right Data
Co-authored with Berkeley

RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
ICLR Workshop 2026

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Open Benchmarks Grants

Backed by a $3M commitment, the program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Learn more

Benchtalks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.

Watch the latest episode

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach

Brown University

Eliot Horowitz Assistant Professor, Computer Science Department

Jason Fries

Stanford University

Assistant Professor of Biomedical Data Science and of Medicine

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup

Prev. Dir. of AI at DIU

Fred Sala

Chief Scientist

Snorkel AI

Assistant Professor @ University of Wisconsin-Madison

Chris Ré

Co-Founder

Snorkel AI

Professor @ Stanford University

Ludwig Schmidt

Stanford University · LAION

Stanford researcher and LAION collaborator

Karthik Narasimhan

Princeton University

Professor of Computer Science

Yu Su

Ohio State University

Associate Professor of Computer Science and Engineering

Lewis Tunstall

Hugging Face

Machine Learning Engineer

PUBLICATIONS

Browse research blogs and academic papers

Scene Graph Prediction With Limited Labels

As deep learning models are applied to increasingly diverse problems, a key bottleneck is gathering enough high-quality training labels tailored to each task. Users therefore turn to weak supervision, relying on imperfect sources of labels like pattern matching and user-defined heuristics. Unfortunately, users have to design these sources for each task. This process can be time consuming and expensive: domain experts often perform repetitive steps like guessing optimal numerical thresholds and developing informative text patterns. To address these challenges, we present Snuba, a system to automatically generate heuristics using a small labeled dataset to assign training labels to a large,...

Research Paper

Scene Graph Prediction With Limited Labels

Dec 13, 2019 •

V. Chen, et al, 2019

Learn more about Scene Graph Prediction With Limited Labels

Osprey: Weak Supervision of Imbalanced Extraction Problems Without Code

Proposing Osprey, a weak-supervision system suited for highly imbalanced data, built on top of the Snorkel framework.

Research Paper

Osprey: Weak Supervision of Imbalanced Extraction Problems Without Code

Proposing Osprey, a weak-supervision system suited for highly imbalanced data, built on top of the Snorkel framework.

Dec 12, 2019 •

E. Bringer, et al, 2019

Learn more about Osprey: Weak Supervision of Imbalanced Extraction Problems Without Code

Multi-Resolution Weak Supervision for Sequential Data

Since manually labeling training data is slow and expensive, recent industrial and scientific research efforts have turned to weaker or noisier forms of supervision sources. However, existing weak supervision approaches fail to model multi-resolution sources for sequential data, like video, that can assign labels to individual elements or collections of elements in a sequence. A key challenge in weak supervision is estimating the unknown accuracies and correlations of these sources without using labeled data. Multi-resolution sources exacerbate this challenge due to complex correlations and sample complexity that scales in the length of the sequence. We propose Dugong, the first framework...

Research Paper

Multi-Resolution Weak Supervision for Sequential Data

Dec 11, 2019 •

P. Varma, et al, 2019

Learn more about Multi-Resolution Weak Supervision for Sequential Data

Medical Device Surveillance With Electronic Health Records

Showcasing state-of-the-art deep learning methods that identify patient outcomes from clinical notes without requiring hand-labeled training data.

Research Paper

Medical Device Surveillance With Electronic Health Records

Showcasing state-of-the-art deep learning methods that identify patient outcomes from clinical notes without requiring hand-labeled training data.

Dec 10, 2019 •

A. Callahan, et al, 2019

Learn more about Medical Device Surveillance With Electronic Health Records

Learning Dependency Structures for Weak Supervision Models

Labeling training data is a key bottleneck in the modern machine learning pipeline. Recent weak supervision approaches combine labels from multiple noisy sources by estimating their accuracies without access to ground truth labels; however, estimating the dependencies among these sources is a critical challenge. We focus on a robust PCAbased algorithm for learning these dependency structures, establish improved theoretical recovery rates, and outperform existing methods on various real-world tasks. Under certain conditions, we show that the amount of unlabeled data needed can scale sublinearly or even logarithmically with the number of sources m, improving over previous efforts that ignore the...

Research Paper

Learning Dependency Structures for Weak Supervision Models

Dec 09, 2019 •

P. Varma, et al, 2019

Learn more about Learning Dependency Structures for Weak Supervision Models

Interactive Programmatic Labeling for Weak Supervision

Demonstrating in synthetic and real-world experiments how two simple labeling function acquisition strategies outperform a random baseline.

Research Paper

Interactive Programmatic Labeling for Weak Supervision

Demonstrating in synthetic and real-world experiments how two simple labeling function acquisition strategies outperform a random baseline.

Dec 08, 2019 •

B. Cohen-Wang, et al, 2019

Learn more about Interactive Programmatic Labeling for Weak Supervision

Bootstrapping Conversational Agents with Weak Supervision

This paper presents a framework called search, label, and propagate (SLP) for bootstrapping intents from existing chat logs using weak supervision.

Research Paper

Bootstrapping Conversational Agents with Weak Supervision

This paper presents a framework called search, label, and propagate (SLP) for bootstrapping intents from existing chat logs using weak supervision.

Dec 07, 2019 •

N. Mallinar, et al, 2019

Learn more about Bootstrapping Conversational Agents with Weak Supervision

A Machine-Compiled Database of Genome-Wide Association Studies

Describing GWASkb, a machine-compiled knowledge base of genetic associations collected from the scientific literature using automated information extraction algorithms.

Research Paper

A Machine-Compiled Database of Genome-Wide Association Studies

Describing GWASkb, a machine-compiled knowledge base of genetic associations collected from the scientific literature using automated information extraction algorithms.

Dec 06, 2019 •

V. Kuleshov, et al, 2019

Learn more about A Machine-Compiled Database of Genome-Wide Association Studies

A Clinical Text Classification Paradigm Using Weak Supervision…

This work develops a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models.

Research Paper

A Clinical Text Classification Paradigm Using Weak Supervision…

Dec 05, 2019 •

Y. Wang, et al, 2019

Learn more about A Clinical Text Classification Paradigm Using Weak Supervision…