our research lab

From cutting-edge research to enterprise and frontier impact

Our research team advances the science of data-centric AI in partnership with leading enterprises and frontier labs. We translate these breakthroughs into production, powering the next generation of AI systems across industries, research, and government.

Deep research roots

Born out of the Stanford AI lab in 2019 and in collaboration with leading research institutions, Snorkel-affiliated researchers have published more than 170 peer-reviewed research papers on weak supervision, AI data development techniques, foundation models, and more—with special recognition at events such as NeurlPS, ICML, and ICLR. Our researchers are closely affiliated with academic institutions including Stanford University, University of Washington, Brown University, and the University of Wisconsin-Madison.

ImageImageImageImage
Image

Featured benchmarks

Exclusive to Snorkel, these benchmarks are meticulously designed and validated by subject matter experts to probe frontier AI models on demanding, specialized tasks.

These are just a few of our featured benchmarks—new ones are added regularly, so check back often to see the latest from our research team.

Leaderboards

Challenging benchmarks for models and agents

Snorkel benchmarks are built with human expertise to test models on realistic tasks ranging from coding and financial analysis to healthcare and more. For example, our SnorkelUnderwrite benchmark includes multi-turn agentic tasks germane to the insurance industry.

See the latest score on our Agentic coding leaderboard.

Rubrics

Aligning human expertise and automated evaluation

We investigate how to scalably develop rubrics that are both comprehensive of the desired agentic capabilities and reliably assessed by both human experts and AI judges.

Learn more about our findings.

RL ENvironments

Environments give agents a fully realized simulation

As tool-calling and more open-ended application requirements break simple test frameworks, agent validation must be done with techniques that reproduce real-world variability. For example, our contributions to Terminal-Bench (tbench.ai) include containerized simulation environments.

Read more in our blog post.

Browse blog posts and 100+ peer reviewed academic papers

Research Paper
NEW

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Read more
Learn More about Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Blog

Benchmarks should shape the frontier, not just measure it

Read more
Learn More about Benchmarks should shape the frontier, not just measure it
Research Paper

RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics

Read more
Learn More about RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
Blog

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Read more
Learn More about Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory
Blog

Building FinQA: An Open RL Environment for Financial Reasoning Agents

Read more
Learn More about Building FinQA: An Open RL Environment for Financial Reasoning Agents
Blog

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Read more
Learn More about How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks
Blog

Coding agents don’t need to be perfect, they need to recover

Read more
Learn More about Coding agents don’t need to be perfect, they need to recover
Blog

Closing the Evaluation Gap in Agentic AI

Read more
Learn More about Closing the Evaluation Gap in Agentic AI
Research Paper

Benchmarking Agents in Insurance Underwriting Environments

Read more
Learn More about Benchmarking Agents in Insurance Underwriting Environments
Research Paper

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Read more
Learn More about Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Blog

SlopCodeBench: Measuring Code Erosion as Agents Iterate

Read more
Learn More about SlopCodeBench: Measuring Code Erosion as Agents Iterate
Blog

Introducing the Snorkel Agentic Coding Benchmark

Read more
Learn More about Introducing the Snorkel Agentic Coding Benchmark
1 2 3 ... 26
Image
Image

Backed by a $3M commitment, the Open Benchmarks Grants program — in partnership with Hugging Face, Prime Intellect, Together AI and Factory HQ and Harbor—  funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Applications are rolling — starting March 1st.

In collaboration with:
Image
Image
Image
Image
Image
Image
Coming Fall 2026

Frontier Data Summit is on the horizon

Get updates on grants, deadlines, and the Frontier Data Summit.