Snorkel at AI Engineer Europe
Join Snorkel at this three-day conference for AI engineers and builders focused on production-ready AI systems – and come meet the team.
The Art & Science of Benchmarking Agents
12:40-1 PM
Location: Moore
Our ability to measure AI has been outpaced by our ability to develop it, and this eval gap is one of the most important problems in AI. We need more enduring benchmarks to close this gap, and consequently advance entire new vectors of capabilities for the field.
In this talk, I'll share our learnings evaluating agents, drawing from experience working with nearly all global frontier labs and leading academics. We'll discuss the science (i.e., mechanics that make benchmarks rigorous and effective) and art (i.e., intangibles driving ambitious and enduring benchmarks) of building great benchmarks.
I'll close by sharing some of the learnings from Open Benchmarks Grants – a $3M initiative in partnership with Hugging Face, Together AI, Prime Intellect, Factory, and others – and highlighting some of the projects we're most excited about funding.

Task Fidelity Scaling Laws
Improving LLM agents isn’t just about bigger models or more data – task fidelity often matters more. Experiments show that fine-tuning on a small set of well-designed tasks can outperform training on many low-quality ones, because ambiguous specs teach models the wrong behaviors. The takeaway: fix your tasks before scaling your models.

Location: Shelley
Stop Making Models Bigger. Make Them Behave
Bigger models often reason better, but they don’t always behave better – especially with tools. This talk shows how a 4B model was fine-tuned to outperform a 235B model on financial analysis tasks by learning strong tool discipline with reinforcement learning, demonstrating that better behavior – not bigger models – can drive stronger real-world results.

Location: Wordsworth



