Snorkel helps build Terminal-Bench 2.0. Learn More
From cutting-edge research to enterprise and frontier impact
Deep research roots
Featured Benchmarks
These are just a few of our featured benchmarks — new ones are added regularly, so check back often to see the latest from our research team.
SnorkelUnderwrite
Finance Reasoning
SnorkelSequences
Leaderboards
Challenging benchmarks for models and agents
Snorkel benchmarks are built with human expertise to test models on realistic tasks ranging from coding, to financial analysis, healthcare, and much more. For example, our SnorkelUnderwrite benchmark includes multi-turn agentic tasks germane to the insurance industry.
See the latest scores on our SnorkelUnderwrite leaderboard.
Rubrics
Aligning human expertise and automated evaluation
We investigate how to scalably develop rubrics that are both comprehensive of the desired agentic capabilities and reliably assessed by both human experts and AI judges.
Learn more about our findings.
RL ENvironments
Environments give agents a fully realized simulation
As tool-calling and more open-ended application requirements break simple test frameworks, agent validation must be done with techniques that reproduce real-world variability. For example, our contributions to Terminal-Bench (tbench.ai) include containerized simulation environments.
Read more in our blog post.