We develop methods, benchmarks, and training systems that turn expert data into frontier AI

Browse research library

building benchmarks and collaborating with

from the lab

Featured research

NEW

Benchmark

Senior SWE-bench built by Snorkel with Princeton University and UW-Madison

Benchmark

Open Benchmark Grants

Continual Learning Bench: Evaluating agents that adapt and improve over time

Research Paper

Accepted to MLSys

Learning from Less: Measuring the Effectiveness of RLVR in Low Data Compute Regimes

Benchmark

Open Benchmarks Grants

Agents' Last Exam: Challenge and measure AI agents on economically valuable and real-world tasks

Benchmark

Open Benchmark Grants

Agentic Coding benchmark: Evaluating AI Models on complex, real-world coding tasks

key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

Benchmarking & Evaluation

Build benchmarks that define and advance the AI frontier

featured work

Senior SWE-bench
Built with Princeton & UW-Madison

OSWorld 2.0
Co-authored with XLANG Lab

Agents' Last Exam
Co-authored with UC Berkeley RDI

BigLaw Bench: Research
Co-released with Harvey

Scaling Subject Matter Expertise

Define how subject matter experts encode their knowledge into data

featured work

Weak-to-Strong Generalization Through Data-Centric Lens
ICLR 2025

Shrinking the Generation–Verification Gap with Weak Verifiers
NeurIPS 2025

Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision
UAI 2022, Best Paper Runner-Up

RL, Training, & Data Valuation

Drive dataset development based on feedback from RL and model training

featured work

Learning from Less: Effectiveness of RLVR in Low Data and Compute Regimes
MLSys 2026

4B FinQA Model Outperforms 235B Model with the Right Data
Co-authored with Berkeley

RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
ICLR Workshop 2026

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Open Benchmarks Grants

Backed by a $3M commitment, the program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Learn more

Benchtalks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.

Watch the latest episode

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach

Brown University

Eliot Horowitz Assistant Professor, Computer Science Department

Jason Fries

Stanford University

Assistant Professor of Biomedical Data Science and of Medicine

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup

Prev. Dir. of AI at DIU

Fred Sala

Chief Scientist

Snorkel AI

Assistant Professor @ University of Wisconsin-Madison

Chris Ré

Co-Founder

Snorkel AI

Professor @ Stanford University

Ludwig Schmidt

Stanford University · LAION

Stanford researcher and LAION collaborator

Karthik Narasimhan

Princeton University

Professor of Computer Science

Yu Su

Ohio State University

Associate Professor of Computer Science and Engineering

Lewis Tunstall

Hugging Face

Machine Learning Engineer

PUBLICATIONS

Browse research blogs and academic papers

Training Classifiers with Natural Language Explanations

Training accurate classifiers requires many labels, but each label provides only limited information (one bit for binary classification). In this work, we propose BabbleLabble, a framework for training classifiers in which an annotator provides a natural language explanation for each labeling decision. A semantic parser converts these explanations into programmatic labeling functions that generate noisy labels for an arbitrary amount of unlabeled data, which is used to train a classifier. On three relation extraction tasks, we find that users are able to train classifiers with comparable F1 scores from 5–100× faster by providing explanations instead of just labels. Furthermore, given...

Research Paper

Training Classifiers with Natural Language Explanations

Dec 20, 2018 •

B. Hancock, et al, 2018

Learn more about Training Classifiers with Natural Language Explanations

Software 2.0 and Snorkel: Beyond Hand-Labeled Data

This paper describes Snorkel, a system that enables users to help shape, create, and manage training data for Software 2.0 stacks.

Research Paper

Software 2.0 and Snorkel: Beyond Hand-Labeled Data

This paper describes Snorkel, a system that enables users to help shape, create, and manage training data for Software 2.0 stacks.

Dec 19, 2018 •

C. Ré, 2018 (invited)

Learn more about Software 2.0 and Snorkel: Beyond Hand-Labeled Data

Snorkel MeTaL: Weak Supervision for Multi-Task Learning

Presenting Snorkel MeTal, an end-to-end system for multi-task learning.

Research Paper

Snorkel MeTaL: Weak Supervision for Multi-Task Learning

Presenting Snorkel MeTal, an end-to-end system for multi-task learning.

Dec 18, 2018 •

A. Ratner, et al, 2018

Learn more about Snorkel MeTaL: Weak Supervision for Multi-Task Learning

Fonduer: Knowledge Base Construction From Richly Formatted Data

Introducing Fonduer, a machine-learning-based KBC system for richly formatted data.

Research Paper

Fonduer: Knowledge Base Construction From Richly Formatted Data

Introducing Fonduer, a machine-learning-based KBC system for richly formatted data.

Dec 17, 2018 •

S. Wu, et al, 2018

Learn more about Fonduer: Knowledge Base Construction From Richly Formatted Data

Deep Text Mining of Instagram Data Without Strong Supervision

This paper showcases methods for unsupervised mining of fashion attributes from Instagram text, which can enable a new kind of user recommendation in the fashion domain.

Research Paper

Deep Text Mining of Instagram Data Without Strong Supervision

This paper showcases methods for unsupervised mining of fashion attributes from Instagram text, which can enable a new kind of user recommendation in the fashion domain.

Dec 16, 2018 •

K. Hammar, et al, 2018

Learn more about Deep Text Mining of Instagram Data Without Strong Supervision

Snorkel: Fast Training Set Generation for Information Extraction

Introducing Snorkel, a new system for quickly creating, managing, and modeling training datasets.

Research Paper

Snorkel: Fast Training Set Generation for Information Extraction

Introducing Snorkel, a new system for quickly creating, managing, and modeling training datasets.

Dec 20, 2017 •

A. Ratner, et al, 2017

Learn more about Snorkel: Fast Training Set Generation for Information Extraction

Learning to Compose Domain-Specific Transformations for Data Augmentation

Automating data augmentation by learning a generative sequence model over user-specified transformation functions.

Research Paper

Learning to Compose Domain-Specific Transformations for Data Augmentation

Automating data augmentation by learning a generative sequence model over user-specified transformation functions.

Dec 19, 2017 •

A. Ratner, et al, 2017

Learn more about Learning to Compose Domain-Specific Transformations for Data Augmentation

Learning the Structure of Generative Models Without Labeled Data

Proposing a structure estimation method that is 100x faster than a maximum likelihood approach for training data.

Research Paper

Learning the Structure of Generative Models Without Labeled Data

Proposing a structure estimation method that is 100x faster than a maximum likelihood approach for training data.

Dec 18, 2017 •

S. Bach, et al, 2017

Learn more about Learning the Structure of Generative Models Without Labeled Data

Inferring Generative Model Structure With Static Analysis

Obtaining enough labeled data to robustly train complex discriminative models is a major bottleneck in the machine learning pipeline. A popular solution is combining multiple sources of weak supervision using generative models. The structure of these models affects the quality of the training labels, but is difficult to learn without any ground truth labels. We instead rely on weak supervision sources having some structure by virtue of being encoded programmatically. We present Coral, a paradigm that infers generative model structure by statically analyzing the code for these heuristics, thus significantly reducing the amount of data required to learn structure. We...

Research Paper

Inferring Generative Model Structure With Static Analysis

Dec 17, 2017 •

P. Varma, et al, 2017

Learn more about Inferring Generative Model Structure With Static Analysis