our research lab

From cutting-edge research to enterprise and frontier impact

Our research team advances the science of data-centric AI in partnership with leading enterprises and frontier labs. We translate these breakthroughs into production, powering the next generation of AI systems across industries, research, and government.

Browse our research library

Deep research roots

Born out of the Stanford AI lab in 2019 and in collaboration with leading research institutions, Snorkel-affiliated researchers have published more than 170 peer-reviewed research papers on weak supervision, AI data development techniques, foundation models, and more—with special recognition at events such as NeurlPS, ICML, and ICLR. Our researchers are closely affiliated with academic institutions including Stanford University, University of Washington, Brown University, and the University of Wisconsin-Madison.

Featured benchmarks

Exclusive to Snorkel, these benchmarks are meticulously designed and validated by subject matter experts to probe frontier AI models on demanding, specialized tasks.

These are just a few of our featured benchmarks—new ones are added regularly, so check back often to see the latest from our research team.

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

View all results

SnorkelUnderwrite

An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.

View all results

Finance Reasoning

A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions, through tool-calling and planning.

View all results

View all benchmarks

Leaderboards

Challenging benchmarks for models and agents

Snorkel benchmarks are built with human expertise to test models on realistic tasks ranging from coding and financial analysis to healthcare and more. For example, our SnorkelUnderwrite benchmark includes multi-turn agentic tasks germane to the insurance industry.

See the latest scores on our SnorkelUnderwrite leaderboard.

Rubrics

Aligning human expertise and automated evaluation

We investigate how to scalably develop rubrics that are both comprehensive of the desired agentic capabilities and reliably assessed by both human experts and AI judges.

Learn more about our findings.

RL ENvironments

Environments give agents a fully realized simulation

As tool-calling and more open-ended application requirements break simple test frameworks, agent validation must be done with techniques that reproduce real-world variability. For example, our contributions to Terminal-Bench (tbench.ai) include containerized simulation environments.

Read more in our blog post.

Browse blog posts and 100+ peer reviewed academic papers

Blog

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Read more

Learn More about Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Research Paper

Automating benchmark design

Read more

Learn More about Automating benchmark design

Blog

The self-critique paradox: Why AI verification fails where it’s needed most

Read more

Learn More about The self-critique paradox: Why AI verification fails where it’s needed most

Blog

Snorkeling in RL environments

Read more

Learn More about Snorkeling in RL environments

Blog

The science of rubric design

Read more

Learn More about The science of rubric design

Blog

Part V: Future direction and emerging trends

Read more

Learn More about Part V: Future direction and emerging trends

Research Paper

Shrinking the generation-verification gap with weak verifiers

Read more

Learn More about Shrinking the generation-verification gap with weak verifiers

Blog

Data quality and rubrics: how to build trust in your models

Read more

Learn More about Data quality and rubrics: how to build trust in your models

Research Paper

Theoretical Physics Benchmark (TPBench)—a dataset and study of AI reasoning capabilities in theoretical physics

Read more

Learn More about Theoretical Physics Benchmark (TPBench)—a dataset and study of AI reasoning capabilities in theoretical physics

Research Paper

WONDERBREAD: a benchmark for evaluating multimodal foundation models on business process management tasks

Read more

Learn More about WONDERBREAD: a benchmark for evaluating multimodal foundation models on business process management tasks

Research Paper

The ALCHEmist: automated labeling 500x cheaper than LLM data annotators

Read more

Learn More about The ALCHEmist: automated labeling 500x cheaper than LLM data annotators

Research Paper

Skill-It! A data-driven skills framework for understanding and training language models

Read more

Learn More about Skill-It! A data-driven skills framework for understanding and training language models

1 2 3 ... 26

Backed by a $3M commitment, the Open Benchmarks Grants program — in partnership with Hugging Face, Prime Intellect, Together AI and Factory HQ and Harbor— funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Applications are rolling — starting March 1st.

Apply for a grant

In collaboration with:

Coming Fall 2026

Frontier Data Summit is on the horizon

Get updates on grants, deadlines, and the Frontier Data Summit.

Join the Snorkel research team

Join our team of leading researchers and help shape the future of AI.

View all careers