Agents' Last Exam
A benchmark for evaluating AI agents on long-horizon, economically valuable professional workflows with verifiable outcomes. 55 sub-industries, 1,500+ tasks toward a 5,000-task target, sourced and validated by 300+ industry experts.






Agents’ Last Exam (ALE) is building the broadest-coverage agent evaluation benchmark to date, measuring performance on long-horizon, economically valuable tasks with verifiable outcomes. The benchmark covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy), spanning all 55 targeted sub-industries.
ALE-V1 ships 147 reference tasks across 55 industries as the current public subset of a 1,500+ task corpus. Many tasks require private data or licensed software and remain in a separate private pool. ALE uses rolling evaluation: every ~6 months a new public subset is published with fresh instances, while private tasks rotate in and retired public tasks rotate out, to limit benchmark leakage.
Leaderboard
| Rank | Harness | Model | Pass Rate | Score | Runtime | Input Tokens | Output Tokens |
|---|---|---|---|---|---|---|---|
| 1 | Codex | gpt-5-5 |
24%
|
42.8%
|
369h 50m | 1.6B | 7.2M |
| 2 | ALE Claw | gpt-5-5 |
23%
|
45.8%
|
47h 20m | 334.5M | 2.4M |
| 3 | Claude Code | claude-fable-5 |
22%
|
40.5%
|
197h 38m | 886.6M | 9.6M |
| 4 | OpenClaw | gpt-5-5 |
21.1%
|
41%
|
92h 51m | 471.1M | 3.3M |
| 5 | Cursor CLI | gpt-5-5 |
20.7%
|
39.6%
|
82h 13m | 154.2M | 1.7M |
| 6 | OpenClaw | gpt-5-4 |
20.5%
|
37.3%
|
162h 16m | 545.5M | 8.7M |
| 7 | Cursor CLI | composer-2-5 |
20.4%
|
38.5%
|
249h 59m | 338.8M | 2.9M |
| 8 | Droid | gpt-5-5 |
19.1%
|
38.6%
|
88h 10m | 243.2M | 2.3M |
| 9 | ALE Claw | claude-opus-4-7 |
18.4%
|
40.5%
|
87h 54m | 1.4B | 5.7M |
| 10 | Claude Code | claude-opus-4-8 |
15.8%
|
37.2%
|
451h 15m | 452.0M | 3.8M |
| 11 | Gemini CLI | gemini-3-1-pro-preview |
15.8%
|
32%
|
272h 28m | 1.2B | 3.5M |
| 12 | OpenClaw | claude-opus-4-7 |
15.1%
|
34.6%
|
143h 19m | 833.0M | 4.1M |
| 13 | OpenClaw | claude-opus-4-6 |
14.1%
|
32.5%
|
164h 33m | 441.2M | 4.2M |
| 14 | OpenClaw | gemini-3-1-pro-preview |
14.1%
|
28.7%
|
174h 18m | 3.6B | 4.0M |
| 15 | Claude Code | claude-opus-4-7 |
13.2%
|
35.1%
|
50h 38m | 456.4M | 3.7M |
| 16 | Droid | claude-opus-4-7 |
12.8%
|
31%
|
35h 54m | 356.5M | 2.8M |
| 17 | OpenClaw | deepseek-v4-pro |
12.4%
|
27.6%
|
233h 3m | 893.3M | 8.7M |
| 18 | OpenClaw | qwen3-7-max |
11.8%
|
31.1%
|
190h 45m | 1.4B | 17.6M |
| 19 | ALE Claw | gpt-5-4 |
11.8%
|
28.2%
|
65h 6m | 1.1B | 2.1M |
| 20 | OpenClaw | glm-5-1 |
11.5%
|
28.2%
|
321h 11m | 1.4B | 11.4M |
| 21 | OpenClaw | kimi-k2-6 |
9.2%
|
21.7%
|
292h 52m | 453.4M | 9.3M |
| 22 | OpenClaw | qwen3-6-plus |
8.6%
|
24.3%
|
258h 22m | 1.2B | 12.6M |
| 23 | OpenClaw | mimo-v2-5 |
8.6%
|
23.6%
|
194h 48m | 730.8M | 7.2M |
| 24 | Codex | gpt-5-4 |
7.2%
|
12.8%
|
49h 6m | 210.7M | 3.3M |
| 25 | Grok CLI | grok-4-3 |
6.6%
|
20.1%
|
62h 38m | 232.4M | 2.4M |
| 26 | OpenClaw | minimax-m2-7 |
5.9%
|
14.2%
|
190h 12m | 367.5M | 6.0M |
| 27 | Grok CLI | grok-3 |
4.6%
|
12.6%
|
32h 11m | 55.7M | 516.5K |
| 28 | OpenClaw | grok-4-3 |
4.3%
|
15.5%
|
176h 50m | 311.9M | 5.0M |
| 29 | Gemini CLI | gemini-3-5-flash |
0%
|
0%
|
8m 28s | 2.2M | 38.8K |
| 30 | OpenClaw CLI | qwen3-7-max |
0%
|
0%
|
1h 36m | 1.4M | 10.9K |
Sample tasks
A selection from the 147 public ALE-V1 tasks across 14 task categories. Each task ships with a sandboxed environment, a hidden reference, and a deterministic grader. Slugs link to the task source.
business_finance
sec_10k_financial_parsing
Parse a SEC 10-K filing into a structured financial schema. Multi-step extraction, table normalization, and cross-reference validation against the original document.
business_finance
Clinical_Variant_Annotation
Methodology
Metrics
Pass Rate — fraction of tasks the agent fully completed (strict success). Score — average graded outcome across all tasks, including partial credit. Both computed by deterministic graders against hidden references.
Every ~6 months, a new public subset releases with fresh instances. Private tasks rotate into the public pool, retired public tasks rotate out, and held-out private tasks score the official leaderboard, to limit benchmark leakage.
Reference Harnesses
Two open harnesses ship with the framework: the official Claude Code CLI and the in-tree OpenClaw harness. Submissions also include Codex, Cursor CLI, Droid, Gemini CLI, Grok CLI, and the ALE Claw reference harness.
Resources
Acknowledgments
Agents’ Last Exam is co-led by UC Berkeley RDI and the RDI Foundation, with funding support and contributions from Snorkel AI via the Open Benchmarks Grants program. The benchmark draws task contributions from 300+ industry experts across 44 academic institutions (MIT, Harvard, Stanford, UC Berkeley, Oxford, CMU, Caltech, ETH Zurich, Yale, Columbia, and more) and industry organizations including Goldman Sachs, JPMorgan, Morgan Stanley, PIMCO, Meta, Amazon, Adobe, Oracle, Hippocratic AI, and HubSpot.
Advisory Committee includes George Em Karniadakis (Brown), Tapio Schneider (Caltech), Teresa Head-Gordon (UC Berkeley), Laure Zanna (NYU), Jack Gallant (UC Berkeley), Tarek Zohdi (UC Berkeley), Ida Sim (UCSF), Arvind Rao (U Michigan), Kaan Ozbay (NYU), Carl Boettiger (UC Berkeley), Kyle Steinfeld (UC Berkeley), Yamini Rangan (HubSpot), and Bradley Rothenberg (nTop).
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.

