

Tom Walshe is a Staff Research Scientist at Snorkel AI. Before Snorkel, Tom worked in LegalTech and finance services, where he focussed on building end-to-end AI systems and researching data-centric AI. Prior to industry, Tom completed a PhD in Computer Science from the University of Oxford.
The latest from Tom Walshe


Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing…


Rubrics turn fuzzy “good vs. bad” into measurable criteria for GenAI. In Part 2, we map what to measure (granularity and dataset-level vs instance-specific), where to measure (process vs outcome), and how to measure (humans, LLM-as-judge, code, reward models)—with examples like HHH, FLASK, HealthBench, and PaperBench.


Ensure your LLMs align with your values and goals using LLM alignment techniques. Learn how to mitigate risks and optimize performance.


Snorkel takes a step on the path to enterprise superalignment with new data development workflows for enterprise alignment



