arXiv Preprint
|
2026

Agents’ Last Exam

Yiyou Sun, Dawn Song, et al. (UC Berkeley RDI) with contributions from Snorkel AI's Amanda Dsouza and Vincent Sunn Chen

Abstract

Recent AI systems have achieved strong results on a wide range of benchmarks, yet
these gains have not translated into economically meaningful deployment across
many professional domains. We argue that this gap is largely an evaluation problem:
widely used benchmarks lack sustained performance measurement on real and
economically valuable workflows. This paper introduces Agents’ Last Exam
(ALE), a benchmark designed to evaluate AI agents on long horizon, economically
valuable, real world tasks with verifiable outcomes. Developed in collaboration
with 250+ industry experts, ALE covers non-physical industries defined with
reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is
organized around a task taxonomy with 55 sub fields grouped into 13 industry
clusters covering 1K+ tasks. Current results show that the hardest tier remains
far from saturated: across mainstream harness and backbone configurations, the
average full pass rate is 2.6%. ALE is designed as a living benchmark: its task
pool grows continuously as new workflows and industries are onboarded. More
broadly, ALE is intended not merely as another leaderboard, but as an instrument
for closing the gap between benchmark success and GDP relevant impact.

Share this article
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.