

A large-scale benchmark testing whether AI agents can complete real, economically valuable professional work. Snorkel AI is a contributor and supporter through its Open Benchmarks Grants program.
2.6%
Top-agent pass
rate on the
hardest tier
55
Non-physical
industries
covered
1,500+
Real,
code-graded
tasks
250+
Experts
across 100+
institutions
Agents’ Last Exam (ALE) is a large-scale AI agent benchmark that measures whether autonomous agents can complete real professional work, not textbook problems. Instead of scripting an agent step by step, ALE hands a frontier agent a real task on a real machine, lets it work to completion, and scores the artifacts it leaves behind against verifiable success criteria.
Each task is a genuine project that a domain expert has already shipped, converted into a code-graded, fully reproducible test. There are no human judges, just fully reproducible, code-graded deliverables, which keeps results objective and directly comparable across very different kinds of work. The benchmark is co-led by UC Berkeley RDI and built with 250+ experts across 100+ institutions.
Forecasts increasingly claim that AI agents will outperform humans at almost all jobs by 2026 to 2027. Agents’ Last Exam was built to test that claim on real, labor-market-aligned work, and the results are a reality check.
The best-performing agent passes just 26% of tasks overall, and the average full pass rate on the hardest “Last-Exam” tier is only 2.6%. Most agent benchmarks are saturating quickly. ALE moves in the opposite direction by measuring the long-horizon, real-world work that today’s agents still cannot reliably finish.
That gap is exactly the point. The name “Agents’ Last Exam” reflects a simple idea: the day agents saturate ALE is the day they can genuinely power real industries. Worth noting: ALE is a technical instrument that measures task completion, not a direct labor-market forecast. It tracks capability, not economic outcomes.
Until then, it gives researchers a rigorous, reproducible way to see where frontier agents succeed and where they break.
| # | Harness | Model | Pass rate | Score | Total Runtime | Total input tokens | Total output tokens |
| 1 | Codex | gpt-5-5 | 25.0% | 43.0% | 80h 37m | 577.0M | 3.8M |
| 2 | Ale Claw | gpt-5-5 | 23.0% | 45.8% | 47h 20m | 334.5M | 2.4M |
| 3 | Openclaw | gpt-5-5 | 21.1% | 41.0% | 92h 51m | 471.1M | 3.3M |
| 4 | Cursor Cli | gpt-5-5 | 20.7% | 39.6% | 82h 13m | 154.2M | 1.7M |
| 5 | Openclaw | gpt-5-4 | 20.5% | 37.3% | 162h 16m | 545.5M | 8.7M |
| 6 | Cursor Cli | composer-2-5 | 20.4% | 38.5% | 249h 59m | 338.8M | 2.9M |
| 7 | Droid | gpt-5-5 | 19.1% | 38.6% | 88h 10m | 243.2M | 2.3M |
| 8 | Ale Claw | claude-opus-4-7 | 18.4% | 40.5% | 87h 54m | 1.4B | 5.7M |
| 9 | Claude Code | claude-opus-4-8 | 15.8% | 37.2% | 451h 15m | 452.0M | 3.8M |
| 10 | Gemini Cli | gemini-3-1-pro-preview | 15.8% | 32.0% | 272h 28m | 1.2B | 3.5M |
See full leaderboard here.
ALE spans 55 non-physical sub-industries, grounded in the O*NET/SOC 2018 federal occupational taxonomy, covering most major categories of professional work performed on a computer.
| Field | Example domains |
|---|---|
| Business & Finance | Quant finance & trading, accounting & risk modeling, sales & marketing, HR management |
| Engineering & Robotics | Aerospace & mechanical engineering, civil & geospatial, robotics & autonomous systems |
| Life Sciences & Medicine | Genomics & sequence analysis, cell & imaging biology, clinical informatics, neuroimaging |
| Physical Sciences | Physics, chemistry & astrophysics, earth & atmospheric sciences, materials design |
| Computing & Math | Software engineering, data & analytics, AI engineering, cybersecurity, quantum computing |
| Media, Arts & Design | Motion & VFX, 3D animation, graphic & visual design, fashion & apparel |
| Legal, Education & Social | Legal research, educational technology, library & information science, social sciences |
ALE targets the Generalist Computer-Use Agent (GCUA): a system given full access to both a graphical interface (GUI) and a command line (CLI). The benchmark does not constrain how an agent solves a task. Whatever a human could do on a computer (click, type, script, browse, automate) the agent is free to do, and it is judged on the result rather than the method.
Tasks ship across three difficulty tiers (near-term, full-spectrum, and last-exam), plus an unlicensed track and ALE-CLI, a terminal-only subset for teams running command-line agents.
Current leaderboard (harness + flagship model): Codex (gpt-5-5) leads at roughly 26% overall, followed by Cursor (composer-2-5) and Claude Code (opus-4-8). Nobody is close to passing, and the exam is still open.
ALE collects industry production tasks executed with professional-grade tools and software, not simplified chat interactions. Strong tasks meet three criteria:
Every submission is defined by five components: task description, input, software, output, and evaluation.
Snorkel AI is proud to support Agents’ Last Exam both as a contributor, with Snorkel researchers among the co-authors, and through its Open Benchmarks Grants program, which funds open-source AI research and rigorous, reproducible evaluation. Real-world evaluation is the bottleneck for trustworthy AI agents, and ALE is exactly the kind of open benchmark that moves the field forward.
This builds on Snorkel’s broader work in data-centric AI research and agent evaluation, including contributions to benchmarks such as Terminal-Bench Science and JudgmentBench.