We develop methods, benchmarks, and training systems that turn expert data into frontier AI
building benchmarks and collaborating with
Featured research
Vision and impact
We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.
Benchmarking & Evaluation
Build benchmarks that define and advance the AI frontier
Scaling Subject Matter Expertise
Define how subject matter experts encode their knowledge into data
RL, Training, & Data Valuation
Drive dataset development based on feedback from RL and model training
Community and open science
Open benchmarks, conversations, and research for real-world AI performance.

Open Benchmarks Grants
Backed by a $3M commitment, the program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Benchtalks

Reading Group
DEEP RESEARCH Expertise
Technical advisors and distinguished affiliates
Browse research blogs and academic papers
Demonstrating in synthetic and real-world experiments how two simple labeling function acquisition strategies outperform a random baseline.
This paper presents a framework called search, label, and propagate (SLP) for bootstrapping intents from existing chat logs using weak supervision.
Describing GWASkb, a machine-compiled knowledge base of genetic associations collected from the scientific literature using automated information extraction algorithms.
This work develops a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models.
Introducing BabbleLabble, a framework for training classifiers in which an annotator provides a natural language explanation for each labeling decision.
This paper describes Snorkel, a system that enables users to help shape, create, and manage training data for Software 2.0 stacks.
Presenting Snorkel MeTal, an end-to-end system for multi-task learning.
Introducing Fonduer, a machine-learning-based KBC system for richly formatted data.
This paper showcases methods for unsupervised mining of fashion attributes from Instagram text, which can enable a new kind of user recommendation in the fashion domain.










