Introducing Open Benchmarks Grants, a $3M commitment to open benchmarks. Apply now
From cutting-edge research to enterprise and frontier impact
Deep research roots
Born out of the Stanford AI lab in 2019 and in collaboration with leading research institutions, Snorkel-affiliated researchers have published more than 170 peer-reviewed research papers on weak supervision, AI data development techniques, foundation models, and more—with special recognition at events such as NeurlPS, ICML, and ICLR. Our researchers are closely affiliated with academic institutions including Stanford University, University of Washington, Brown University, and the University of Wisconsin-Madison.




Featured benchmarks
Exclusive to Snorkel, these benchmarks are meticulously designed and validated by subject matter experts to probe frontier AI models on demanding, specialized tasks.
These are just a few of our featured benchmarks—new ones are added regularly, so check back often to see the latest from our research team.
Agentic Coding
SnorkelUnderwrite
Finance Reasoning
Leaderboards
Challenging benchmarks for models and agents
Snorkel benchmarks are built with human expertise to test models on realistic tasks ranging from coding and financial analysis to healthcare and more. For example, our SnorkelUnderwrite benchmark includes multi-turn agentic tasks germane to the insurance industry.
Rubrics
Aligning human expertise and automated evaluation
We investigate how to scalably develop rubrics that are both comprehensive of the desired agentic capabilities and reliably assessed by both human experts and AI judges.
RL ENvironments
Environments give agents a fully realized simulation
As tool-calling and more open-ended application requirements break simple test frameworks, agent validation must be done with techniques that reproduce real-world variability. For example, our contributions to Terminal-Bench (tbench.ai) include containerized simulation environments.
Browse blog posts and 100+ peer reviewed academic papers
Terminal-Bench 2.0: Raising the bar for AI agent evaluation
The self-critique paradox: Why AI verification fails where it’s needed most
Part V: Future direction and emerging trends
Shrinking the generation-verification gap with weak verifiers
Data quality and rubrics: how to build trust in your models
Theoretical Physics Benchmark (TPBench)—a dataset and study of AI reasoning capabilities in theoretical physics
WONDERBREAD: a benchmark for evaluating multimodal foundation models on business process management tasks
The ALCHEmist: automated labeling 500x cheaper than LLM data annotators
Skill-It! A data-driven skills framework for understanding and training language models

Backed by a $3M commitment, the Open Benchmarks Grants program — in partnership with Hugging Face, Prime Intellect, Together AI and Factory HQ and Harbor— funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.
Applications are rolling — starting March 1st.



