Join our inaugural Reading Group in San Francisco on April 29. Register now
From cutting-edge research to enterprise and frontier impact
Deep research roots
Born out of the Stanford AI lab in 2019 and in collaboration with leading research institutions, Snorkel-affiliated researchers have published more than 170 peer-reviewed research papers on weak supervision, AI data development techniques, foundation models, and more—with special recognition at events such as NeurlPS, ICML, and ICLR. Our researchers are closely affiliated with academic institutions including Stanford University, University of Washington, Brown University, and the University of Wisconsin-Madison.




Featured benchmarks
Exclusive to Snorkel, these benchmarks are meticulously designed and validated by subject matter experts to probe frontier AI models on demanding, specialized tasks.
These are just a few of our featured benchmarks—new ones are added regularly, so check back often to see the latest from our research team.
Agentic Coding
Finance Reasoning
Leaderboards
Challenging benchmarks for models and agents
Snorkel benchmarks are built with human expertise to test models on realistic tasks ranging from coding and financial analysis to healthcare and more. For example, our SnorkelUnderwrite benchmark includes multi-turn agentic tasks germane to the insurance industry.
Rubrics
Aligning human expertise and automated evaluation
We investigate how to scalably develop rubrics that are both comprehensive of the desired agentic capabilities and reliably assessed by both human experts and AI judges.
RL ENvironments
Environments give agents a fully realized simulation
As tool-calling and more open-ended application requirements break simple test frameworks, agent validation must be done with techniques that reproduce real-world variability. For example, our contributions to Terminal-Bench (tbench.ai) include containerized simulation environments.
Browse blog posts and 100+ peer reviewed academic papers
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Benchmarks should shape the frontier, not just measure it
RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory
Building FinQA: An Open RL Environment for Financial Reasoning Agents
How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks
Coding agents don’t need to be perfect, they need to recover
Closing the Evaluation Gap in Agentic AI
Benchmarking Agents in Insurance Underwriting Environments
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
SlopCodeBench: Measuring Code Erosion as Agents Iterate
Introducing the Snorkel Agentic Coding Benchmark

Backed by a $3M commitment, the Open Benchmarks Grants program — in partnership with Hugging Face, Prime Intellect, Together AI and Factory HQ and Harbor— funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.
Applications are rolling — starting March 1st.



