Search result for:
Snorkel Expert
Data-as-a-Service Leaderboard
We built these leaderboards to put frontier LLMs to the test across a variety of expert-level, domain-specific agentic AI tasks. The leaderboards use Snorkel Expert Data-as-a-Service to create specialized and high quality datasets, powered by Snorkel's global network of experts across 1000s of domains in academia/PhD-level topics, professional domains, and consumer/lifestyle areas.
Featured benchmarks
Exclusive to Snorkel, these benchmarks are meticulously designed and validated by subject matter experts to probe frontier AI models on demanding, specialized tasks.
These are just a few of our featured benchmarks—new ones are added regularly, so check back often to see the latest from our research team.
SnorkelUnderwrite
An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.
View All Results
Finance Reasoning
A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions, through tool-calling and planning.
View All Results
SnorkelSequences
A procedurally-generated and expert-verified benchmark for evaluating mathematical reasoning and compositional capabilities in LLMs.
View All Results
SnorkelSpatial
A procedurally-generated benchmark for evaluating allocentric and egocentric spatial reasoning capabilities in LLMs.
View All Results
SnorkelWordle
A benchmark designed to evaluate linguistic reasoning and instruction-following capabilities in language models through the iterative and constrained gameplay of Wordle.
View All Results
SnorkelGraph
A procedurally-generated and expert verified benchmark for evaluating mathematical and spatial reasoning capabilities of LLMs through graph reasoning problems.
View All Results
SnorkelFinance
A benchmark of expert-verified financial QA created from financial reports for evaluating AI agents on tool-calling and reasoning capabilities.
View All Results
Performance per dollar
Discover which models deliver the best performance per dollar spent.
Domain
Select a domain
Cost
Output Tokens
(Cost per million output tokens generated)
Input Tokens
(Cost per million input tokens processed)
Select cost
Compare models
Select two models to compare their performance across all benchmark categories.
Model 1
Select a model
Model 2
Select a model
Snorkel Expert Data-as-a-Service
Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.