We develop methods, benchmarks, and training systems that turn expert data into frontier AI
building benchmarks and collaborating with
Featured research
Vision and impact
We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.
Benchmarking & Evaluation
Build benchmarks that define and advance the AI frontier
Scaling Subject Matter Expertise
Define how subject matter experts encode their knowledge into data
RL, Training, & Data Valuation
Drive dataset development based on feedback from RL and model training
Community and open science
Open benchmarks, conversations, and research for real-world AI performance.


Open Benchmarks Grants
Backed by a $3M commitment, the program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.


Benchtalks


Reading Group
DEEP RESEARCH Expertise
Technical advisors and distinguished affiliates
Browse research blogs and academic papers
Presenting Snorkel MeTal, an end-to-end system for multi-task learning.
Introducing Fonduer, a machine-learning-based KBC system for richly formatted data.
This paper showcases methods for unsupervised mining of fashion attributes from Instagram text, which can enable a new kind of user recommendation in the fashion domain.
Introducing Snorkel, a new system for quickly creating, managing, and modeling training datasets.
Automating data augmentation by learning a generative sequence model over user-specified transformation functions.
Proposing a structure estimation method that is 100x faster than a maximum likelihood approach for training data.
Obtaining enough labeled data to robustly train complex discriminative models is a major bottleneck in the machine learning pipeline. A popular solution is combining multiple sources of weak supervision using generative models. The structure of these models affects the quality of the training labels, but is difficult to learn without any ground truth labels. We instead rely on weak supervision…
Introducing SwellShark, a framework for building biomedical named entity recognition (NER) systems quickly.
A challenge in training discriminative models like neural networks is obtaining enough labeled training data. Recent approaches use generative models to combine weak supervision sources, like user-defined heuristics or knowledge bases, to label training data. Prior work has explored learning accuracies for these sources even without ground truth labels, but they assume that a single accuracy parameter is sufficient to…












