Agents' Last Exam
A benchmark for evaluating AI agents on long-horizon, economically valuable professional workflows with verifiable outcomes. 55 sub-industries, 1,500+ tasks toward a 5,000-task target, sourced and validated by 300+ industry experts.






Agents’ Last Exam (ALE) is building the broadest-coverage agent evaluation benchmark to date, measuring performance on long-horizon, economically valuable tasks with verifiable outcomes. The benchmark covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy), spanning all 55 targeted sub-industries.
ALE-V1 ships 147 reference tasks across 55 industries as the current public subset of a 1,500+ task corpus. Many tasks require private data or licensed software and remain in a separate private pool. ALE uses rolling evaluation: every ~6 months a new public subset is published with fresh instances, while private tasks rotate in and retired public tasks rotate out, to limit benchmark leakage.
Leaderboard
| Rank | Harness | Model | Effort | Pass Rate | Score | Est. Cost | Runtime | Input Tokens | Output Tokens |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Code | claude-opus-4-8 | Max |
27%
|
45.2%
|
$3,985 | 191h 6m | 1.7B | 22.1M |
| 2 | Codex | gpt-5-5 | XHigh |
26.3%
|
47.7%
|
$494 | 54h 54m | 530.2M | 4.2M |
| 3 | Codex | gpt-5-5 | High |
24.3%
|
45.3%
|
$389 | 61h 48m | 412.8M | 3.3M |
| 4 | ALE Claw | gpt-5-5 | High |
23%
|
45.8%
|
$311 | 35h 41m | 334.5M | 2.4M |
| 5 | Claude Code | claude-opus-4-8 | XHigh |
22.4%
|
42.3%
|
$2,815 | 159h 36m | 1.2B | 16.5M |
| 6 | Claude Code | claude-fable-5 | Adaptive |
22%
|
40.5%
|
$2,315 | 113h 14m | 880.0M | 9.4M |
| 7 | OpenClaw | gpt-5-5 | High |
21.1%
|
41%
|
$449 | 83h 1m | 471.1M | 3.3M |
| 8 | Cursor CLI | gpt-5-5 | Medium |
20.7%
|
39.6%
|
$174 | 58h 52m | 148.5M | 1.7M |
| 9 | OpenClaw | gpt-5-4 | High |
20.5%
|
37.3%
|
$276 | 128h 22m | 490.5M | 7.4M |
| 10 | Claude Code | claude-opus-4-8 | High |
20.4%
|
38.7%
|
$1,626 | 86h 0m | 1.8B | 6.4M |
| 11 | Cursor CLI | composer-2-5 | Adaptive |
20.4%
|
38.5%
|
$177 | 90h 33m | 338.8M | 2.9M |
| 12 | Claude Code | seed-2.1-pro | High |
19.5%
|
41.4%
|
$936 | 167h 54m | 2.1B | 8.4M |
| 13 | Droid | gpt-5-5 | High |
19.1%
|
38.6%
|
$244 | 64h 3m | 235.8M | 2.2M |
| 14 | ALE Claw | claude-opus-4-7 | High |
18.4%
|
40.5%
|
$1,144 | 78h 9m | 1.4B | 5.7M |
| 15 | Gemini CLI | gemini-3-1-pro-preview | High |
15.8%
|
32%
|
$2,018 | 96h 32m | 1.2B | 3.5M |
| 16 | OpenClaw | claude-opus-4-7 | High |
15.1%
|
34.6%
|
$1,712 | 124h 49m | 818.4M | 4.0M |
| 17 | OpenClaw | gemini-3-1-pro-preview | High |
14.1%
|
28.7%
|
$2,969 | 147h 12m | 3.4B | 3.7M |
| 18 | Claude Code | claude-opus-4-7 | High |
13.2%
|
35.1%
|
$1,793 | 42h 57m | 456.4M | 3.7M |
| 19 | OpenClaw | seed-2.1-pro | High |
13.1%
|
32.2%
|
$440 | 196h 42m | 1.6B | 20.4M |
| 20 | Droid | claude-opus-4-7 | High |
12.8%
|
31%
|
$1,356 | 28h 16m | 352.1M | 2.7M |
| 21 | OpenClaw | deepseek-v4-pro | High |
12.4%
|
27.6%
|
$275 | 131h 38m | 516.3M | 4.8M |
| 22 | OpenClaw | qwen3-7-max | High |
11.8%
|
31.1%
|
$664 | 104h 37m | 1.4B | 17.6M |
| 23 | ALE Claw | gpt-5-4 | High |
11.8%
|
28.2%
|
$335 | 55h 32m | 1.1B | 2.1M |
| 24 | OpenClaw | glm-5-1 | High |
11.5%
|
28.2%
|
$475 | 172h 24m | 818.7M | 6.4M |
| 25 | OpenClaw | kimi-k2-6 | High |
9.2%
|
21.7%
|
$124 | 183h 58m | 296.3M | 6.1M |
| 26 | OpenClaw | qwen3-6-plus | High |
8.6%
|
24.3%
|
$255 | 139h 43m | 743.1M | 7.2M |
| 27 | OpenClaw | mimo-v2-5 | High |
8.6%
|
23.6%
|
$45.20 | 108h 46m | 415.1M | 4.1M |
| 28 | Grok CLI | grok-4-3 | - |
6.6%
|
20.1%
|
$287 | 48h 7m | 225.0M | 2.3M |
| 29 | OpenClaw | minimax-m2-7 | High |
5.9%
|
14.2%
|
$27.04 | 145h 16m | 281.3M | 4.4M |
| 30 | OpenClaw | grok-4-3 | High |
4.3%
|
15.5%
|
$80.44 | 79h 27m | 169.2M | 2.8M |
| Rank | Harness | Model | Effort | Pass Rate | Score | Est. Cost | Runtime | Input Tokens | Output Tokens |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Codex | gpt-5-5 | XHigh |
26.7%
|
49%
|
$286 | 28h 40m | 305.5M | 2.6M |
| 2 | Claude Code | claude-opus-4-8 | Max |
25.7%
|
44.1%
|
$1,953 | 117h 29m | 642.2M | 15.3M |
| 3 | ALE Claw | gpt-5-5 | High |
24.8%
|
48.1%
|
$110 | 16h 51m | 77.2M | 1.3M |
| 4 | OpenClaw | gpt-5-5 | High |
23.8%
|
44.4%
|
$221 | 31h 54m | 213.1M | 2.2M |
| 5 | Codex | gpt-5-5 | High |
23.3%
|
44.3%
|
$344 | 29h 8m | 375.1M | 2.4M |
| 6 | OpenClaw | gpt-5-4 | High |
22.5%
|
41.8%
|
$171 | 50h 44m | 287.6M | 5.0M |
| 7 | Claude Code | claude-fable-5 | Adaptive |
21.9%
|
44.6%
|
$880 | 74h 44m | 301.4M | 6.9M |
| 8 | Cursor CLI | gpt-5-5 | Medium |
21.9%
|
41.8%
|
$98.19 | 14h 34m | 68.9M | 1.1M |
| 9 | Claude Code | claude-opus-4-8 | High |
21.9%
|
42.7%
|
$1,016 | 59h 52m | 1.1B | 4.2M |
| 10 | ALE Claw | claude-opus-4-7 | High |
20%
|
43.3%
|
$427 | 44h 45m | 401.0M | 3.1M |
| 11 | Claude Code | claude-opus-4-8 | XHigh |
20%
|
42.1%
|
$1,713 | 92h 27m | 641.0M | 11.9M |
| 12 | Cursor CLI | composer-2-5 | Adaptive |
19%
|
40.8%
|
$81.00 | 71h 28m | 153.2M | 1.8M |
| 13 | Droid | gpt-5-5 | High |
19%
|
39.5%
|
$138 | 37h 41m | 118.1M | 1.5M |
| 14 | Claude Code | seed-2.1-pro | High |
19%
|
41.4%
|
$602 | 98h 42m | 1.1B | 5.8M |
| 15 | Gemini CLI | gemini-3-1-pro-preview | High |
17.1%
|
36.2%
|
$1,010 | 54h 56m | 450.8M | 2.4M |
| 16 | OpenClaw | claude-opus-4-7 | High |
16.2%
|
37.8%
|
$983 | 54h 8m | 527.4M | 2.8M |
| 17 | OpenClaw | gemini-3-1-pro-preview | High |
15.7%
|
31.7%
|
$1,304 | 66h 29m | 1.4B | 2.4M |
| 18 | Claude Code | claude-opus-4-7 | High |
14.3%
|
38%
|
$1,165 | 26h 8m | 281.0M | 2.7M |
| 19 | Droid | claude-opus-4-7 | High |
14.3%
|
33.7%
|
$710 | 19h 17m | 176.9M | 1.7M |
| 20 | OpenClaw | deepseek-v4-pro | High |
14.1%
|
30.8%
|
$163 | 57h 2m | 317.4M | 3.4M |
| 21 | OpenClaw | qwen3-7-max | High |
13.3%
|
34%
|
$534 | 36h 32m | 1.0B | 15.8M |
| 22 | ALE Claw | gpt-5-4 | High |
13.3%
|
33.1%
|
$169 | 26h 35m | 541.7M | 1.1M |
| 23 | Forgecode | claude-sonnet-4-6 | Medium |
13.3%
|
28.5%
|
$103 | 53h 39m | 137.4M | 1.8M |
| 24 | Hermes | claude-sonnet-4-6 | Medium |
13.2%
|
32%
|
$440 | 25h 27m | 197.3M | 2.0M |
| 25 | OpenClaw | glm-5-1 | High |
12.9%
|
30.8%
|
$248 | 77h 42m | 423.9M | 3.7M |
| 26 | Terminus 2 | claude-sonnet-4-6 | Off |
11.9%
|
30.9%
|
$321 | 73h 13m | 762.9M | 3.2M |
| 27 | OpenClaw | claude-sonnet-4-6 | High |
11.4%
|
31%
|
$181 | 33h 51m | 247.0M | 1.9M |
| 28 | OpenClaw | seed-2.1-pro | High |
11.4%
|
33.2%
|
$440 | 101h 54m | 1.4B | 18.4M |
| 29 | OpenClaw | qwen3-6-plus | High |
10.5%
|
28.6%
|
$130 | 67h 28m | 371.0M | 4.6M |
| 30 | OpenClaw | mimo-v2-5 | High |
10%
|
26.5%
|
$33.40 | 52h 47m | 306.7M | 3.1M |
| 31 | OpenHands | claude-sonnet-4-6 | Not reported |
9%
|
19.8%
|
$247 | 36h 21m | 354.7M | 4.5M |
| 32 | OpenClaw | kimi-k2-6 | High |
8.1%
|
21.2%
|
$91.13 | 90h 11m | 223.3M | 4.5M |
| 33 | Grok CLI | grok-4-3 | - |
7.6%
|
24.3%
|
$208 | 41h 12m | 162.8M | 1.7M |
| 34 | OpenClaw | minimax-m2-7 | High |
5.7%
|
14.6%
|
$22.46 | 53h 35m | 246.8M | 3.2M |
| 35 | OpenClaw | grok-4-3 | High |
4.3%
|
17.5%
|
$61.00 | 36h 30m | 136.4M | 2.1M |
| Rank | Harness | Model | Effort | Pass Rate | Score | Est. Cost | Runtime | Input Tokens | Output Tokens |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Codex | gpt-5-5 | XHigh |
43.3%
|
71.5%
|
$129 | 17h 17m | 109.0M | 1.5M |
| 2 | Claude Code | claude-opus-4-8 | Max |
43.3%
|
64%
|
$1,106 | 53h 49m | 310.5M | 8.3M |
| 3 | Codex | gpt-5-5 | High |
38.1%
|
64.7%
|
$242 | 29h 48m | 240.4M | 1.6M |
| 4 | OpenClaw | gpt-5-5 | High |
35.8%
|
65.7%
|
$218 | 41h 19m | 223.0M | 1.5M |
| 5 | Claude Code | claude-opus-4-8 | XHigh |
35.8%
|
62.7%
|
$683 | 46h 41m | 201.3M | 6.0M |
| 6 | Claude Code | claude-opus-4-8 | High |
34.3%
|
56.8%
|
$446 | 21h 35m | 397.7M | 2.2M |
| 7 | Claude Code | claude-fable-5 | Adaptive |
34.3%
|
63.4%
|
$947 | 33h 43m | 341.7M | 3.1M |
| 8 | Cursor CLI | composer-2-5 | Adaptive |
34.3%
|
61.1%
|
$49.70 | 23h 6m | 94.1M | 1.1M |
| 9 | OpenClaw | gpt-5-4 | High |
33.6%
|
57.8%
|
$68.88 | 40h 56m | 86.0M | 2.4M |
| 10 | ALE Claw | gpt-5-5 | High |
32.8%
|
67.4%
|
$148 | 14h 57m | 167.2M | 1.0M |
| 11 | Cursor CLI | gpt-5-5 | Medium |
32.1%
|
60.8%
|
$68.46 | 30h 10m | 51.8M | 741.1K |
| 12 | Droid | gpt-5-5 | High |
29.9%
|
58.2%
|
$106 | 23h 26m | 98.4M | 962.6K |
| 13 | ALE Claw | claude-opus-4-7 | High |
28.4%
|
60.5%
|
$312 | 20h 40m | 359.2M | 1.9M |
| 14 | Droid | claude-opus-4-7 | High |
27.6%
|
60.2%
|
$738 | 16h 6m | 176.2M | 1.5M |
| 15 | Claude Code | seed-2.1-pro | High |
26.9%
|
58.9%
|
$376 | 65h 18m | 697.7M | 3.1M |
| 16 | OpenClaw | claude-opus-4-7 | High |
26.9%
|
56.5%
|
$508 | 47h 42m | 201.7M | 1.5M |
| 17 | Gemini CLI | gemini-3-1-pro-preview | High |
26.9%
|
53.5%
|
$342 | 28h 46m | 239.4M | 1.1M |
| 18 | OpenClaw | gemini-3-1-pro-preview | High |
26.1%
|
48.3%
|
$575 | 48h 37m | 616.7M | 938.9K |
| 19 | Claude Code | claude-opus-4-7 | High |
20.9%
|
54.3%
|
$496 | 16h 17m | 120.4M | 1.1M |
| 20 | ALE Claw | gpt-5-4 | High |
20.9%
|
44.3%
|
$66.17 | 17h 22m | 178.6M | 684.8K |
| 21 | OpenClaw | glm-5-1 | High |
20.1%
|
45.6%
|
$108 | 62h 17m | 183.5M | 1.6M |
| 22 | OpenClaw | deepseek-v4-pro | High |
19.9%
|
43.8%
|
$109 | 58h 50m | 208.2M | 1.9M |
| 23 | OpenClaw | qwen3-7-max | High |
17.9%
|
46.9%
|
$247 | 39h 19m | 502.0M | 7.3M |
| 24 | OpenClaw | seed-2.1-pro | High |
17.9%
|
46.9%
|
$201 | 78h 24m | 640.9M | 8.9M |
| 25 | OpenClaw | kimi-k2-6 | High |
15.7%
|
35.1%
|
$46.55 | 61h 6m | 118.5M | 2.1M |
| 26 | OpenClaw | qwen3-6-plus | High |
12.7%
|
35.8%
|
$89.59 | 60h 13m | 259.7M | 2.7M |
| 27 | OpenClaw | mimo-v2-5 | High |
11.9%
|
35.1%
|
$12.77 | 43h 37m | 105.4M | 1.5M |
| 28 | OpenClaw | minimax-m2-7 | High |
10.4%
|
24.5%
|
$8.86 | 63h 22m | 98.2M | 1.4M |
| 29 | Grok CLI | grok-4-3 | - |
9%
|
30.4%
|
$127 | 21h 59m | 99.8M | 1.0M |
| 30 | OpenClaw | grok-4-3 | High |
6.7%
|
24.4%
|
$35.01 | 31h 15m | 73.8M | 1.2M |
| Rank | Harness | Model | Effort | Pass Rate | Score | Est. Cost | Runtime | Input Tokens | Output Tokens |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Code | seed-2.1-pro | High |
24.1%
|
40.1%
|
$289 | 49h 12m | 639.5M | 3.2M |
| 2 | Codex | gpt-5-5 | XHigh |
23.6%
|
40.3%
|
$129 | 14h 43m | 120.5M | 1.4M |
| 3 | Claude Code | claude-opus-4-8 | Max |
23.6%
|
43.1%
|
$1,324 | 62h 59m | 524.3M | 8.2M |
| 4 | ALE Claw | gpt-5-5 | High |
23.6%
|
41.1%
|
$70.79 | 8h 34m | 55.1M | 756.3K |
| 5 | Codex | gpt-5-5 | High |
22.7%
|
36%
|
$157 | 19h 9m | 147.0M | 1.3M |
| 6 | Claude Code | claude-fable-5 | Adaptive |
20.9%
|
34.1%
|
$608 | 35h 41m | 327.7M | 3.7M |
| 7 | Cursor CLI | gpt-5-5 | Medium |
20%
|
32.7%
|
$50.38 | 17h 29m | 36.9M | 563.2K |
| 8 | Claude Code | claude-opus-4-8 | XHigh |
20%
|
38.3%
|
$856 | 49h 37m | 300.4M | 5.6M |
| 9 | OpenClaw | gpt-5-4 | High |
19.4%
|
34.3%
|
$125 | 58h 25m | 241.4M | 3.0M |
| 10 | ALE Claw | claude-opus-4-7 | High |
18.2%
|
36.6%
|
$258 | 16h 1m | 300.2M | 1.8M |
| 11 | OpenClaw | gpt-5-5 | High |
18.2%
|
32.1%
|
$104 | 23h 21m | 100.4M | 1.1M |
| 12 | Cursor CLI | composer-2-5 | Adaptive |
18.2%
|
30.8%
|
$67.71 | 27h 7m | 130.0M | 1.1M |
| 13 | Claude Code | claude-opus-4-8 | High |
18.2%
|
37.8%
|
$569 | 26h 29m | 698.1M | 2.3M |
| 14 | Droid | gpt-5-5 | High |
16.4%
|
33.4%
|
$70.65 | 15h 50m | 59.6M | 823.2K |
| 15 | OpenClaw | seed-2.1-pro | High |
13.5%
|
23.6%
|
$171 | 45h 24m | 572.9M | 7.7M |
| 16 | Claude Code | claude-opus-4-7 | High |
12.7%
|
29.1%
|
$747 | 12h 25m | 202.0M | 1.5M |
| 17 | Gemini CLI | gemini-3-1-pro-preview | High |
12.7%
|
26.4%
|
$962 | 26h 32m | 481.7M | 1.7M |
| 18 | OpenClaw | claude-opus-4-7 | High |
10.9%
|
27.5%
|
$393 | 30h 26m | 174.1M | 1.2M |
| 19 | OpenClaw | qwen3-7-max | High |
10.9%
|
27.2%
|
$280 | 36h 42m | 585.1M | 7.1M |
| 20 | OpenClaw | deepseek-v4-pro | High |
10.9%
|
23.8%
|
$69.75 | 39h 2m | 119.0M | 1.6M |
| 21 | OpenClaw | gemini-3-1-pro-preview | High |
10.9%
|
23.6%
|
$1,393 | 62h 53m | 1.6B | 1.7M |
| 22 | ALE Claw | gpt-5-4 | High |
9.1%
|
22.9%
|
$179 | 22h 40m | 596.0M | 936.5K |
| 23 | OpenClaw | glm-5-1 | High |
9.1%
|
21.8%
|
$201 | 65h 41m | 345.3M | 2.7M |
| 24 | OpenClaw | mimo-v2-5 | High |
9.1%
|
20.8%
|
$18.43 | 36h 14m | 180.1M | 1.5M |
| 25 | OpenClaw | qwen3-6-plus | High |
8.2%
|
22.9%
|
$112 | 51h 37m | 326.2M | 3.2M |
| 26 | Grok CLI | grok-4-3 | - |
7.3%
|
17%
|
$97.08 | 15h 30m | 75.8M | 932.5K |
| 27 | OpenClaw | kimi-k2-6 | High |
6.4%
|
18.2%
|
$41.45 | 74h 15m | 93.9M | 2.5M |
| 28 | OpenClaw | grok-4-3 | High |
3.6%
|
12.9%
|
$25.57 | 25h 7m | 52.2M | 1.0M |
| 29 | Droid | claude-opus-4-7 | High |
3.6%
|
10.9%
|
$167 | 3h 8m | 39.9M | 472.1K |
| 30 | OpenClaw | minimax-m2-7 | High |
3.6%
|
8.4%
|
$9.87 | 46h 56m | 99.2M | 1.8M |
| Rank | Harness | Model | Effort | Pass Rate | Score | Est. Cost | Runtime | Input Tokens | Output Tokens |
|---|---|---|---|---|---|---|---|---|---|
| 1 | ALE Claw | gpt-5-5 | High |
2.6%
|
12.8%
|
$107 | 13h 44m | 126.3M | 704.4K |
| 2 | Droid | gpt-5-5 | High |
2.6%
|
11.3%
|
$75.62 | 26h 27m | 83.6M | 551.8K |
| 3 | Cursor CLI | gpt-5-5 | Medium |
2.6%
|
10.7%
|
$61.49 | 17h 12m | 63.9M | 456.4K |
| 4 | Claude Code | claude-opus-4-8 | XHigh |
2.6%
|
10.4%
|
$1,395 | 76h 53m | 739.8M | 5.9M |
| 5 | Claude Code | claude-opus-4-8 | Max |
2.6%
|
14.4%
|
$1,805 | 89h 34m | 961.0M | 7.3M |
| 6 | Codex | gpt-5-5 | High |
0%
|
11.2%
|
$179 | 28h 42m | 198.6M | 1.1M |
| 7 | Codex | gpt-5-5 | XHigh |
0%
|
14.6%
|
$262 | 26h 44m | 325.5M | 1.7M |
| 8 | OpenClaw | gpt-5-5 | High |
0%
|
10.9%
|
$155 | 24h 52m | 178.2M | 997.0K |
| 9 | Claude Code | claude-opus-4-7 | High |
0%
|
10%
|
$685 | 17h 6m | 170.1M | 1.4M |
| 10 | Claude Code | seed-2.1-pro | High |
0%
|
10.8%
|
$319 | 59h 48m | 814.8M | 2.8M |
| 11 | OpenClaw | seed-2.1-pro | High |
0%
|
10.4%
|
$84.70 | 81h 24m | 455.8M | 4.8M |
| 12 | Cursor CLI | composer-2-5 | Adaptive |
0%
|
8.8%
|
$77.87 | 45h 32m | 151.0M | 948.3K |
| 13 | ALE Claw | gpt-5-4 | High |
0%
|
8.1%
|
$100 | 17h 39m | 321.4M | 540.7K |
| 14 | Droid | claude-opus-4-7 | High |
0%
|
8%
|
$496 | 10h 18m | 145.7M | 920.1K |
| 15 | ALE Claw | claude-opus-4-7 | High |
0%
|
7.9%
|
$594 | 43h 8m | 707.3M | 2.2M |
| 16 | OpenClaw | gpt-5-4 | High |
0%
|
7.3%
|
$107 | 36h 16m | 206.9M | 2.5M |
| 17 | OpenClaw | qwen3-7-max | High |
0%
|
6.4%
|
$173 | 34h 42m | 377.8M | 4.1M |
| 18 | OpenClaw | glm-5-1 | High |
0%
|
6.2%
|
$204 | 57h 10m | 354.0M | 2.5M |
| 19 | Claude Code | claude-fable-5 | Adaptive |
0%
|
5.2%
|
$838 | 54h 32m | 231.2M | 3.3M |
| 20 | Claude Code | claude-opus-4-8 | High |
0%
|
7%
|
$735 | 45h 41m | 841.1M | 2.2M |
| 21 | OpenClaw | qwen3-6-plus | High |
0%
|
5%
|
$78.19 | 35h 46m | 228.9M | 1.9M |
| 22 | OpenClaw | claude-opus-4-7 | High |
0%
|
4.3%
|
$842 | 50h 0m | 452.9M | 1.5M |
| 23 | OpenClaw | mimo-v2-5 | High |
0%
|
3.4%
|
$14.92 | 33h 3m | 135.8M | 1.3M |
| 24 | OpenClaw | gemini-3-1-pro-preview | High |
0%
|
3.1%
|
$1,200 | 45h 9m | 1.4B | 1.1M |
| 25 | OpenClaw | deepseek-v4-pro | High |
0%
|
2.5%
|
$108 | 39h 7m | 208.8M | 1.7M |
| 26 | Grok CLI | grok-4-3 | - |
0%
|
2.3%
|
$71.44 | 11h 35m | 56.2M | 477.3K |
| 27 | OpenClaw | grok-4-3 | High |
0%
|
2.3%
|
$22.18 | 25h 4m | 47.2M | 621.3K |
| 28 | OpenClaw | kimi-k2-6 | High |
0%
|
1.6%
|
$38.87 | 60h 46m | 88.1M | 1.8M |
| 29 | OpenClaw | minimax-m2-7 | High |
0%
|
1.3%
|
$9.15 | 44h 14m | 87.7M | 1.5M |
| 30 | Gemini CLI | gemini-3-1-pro-preview | High |
0%
|
0.9%
|
$733 | 42h 12m | 497.3M | 821.0K |
*claude-fable-5: the variant Anthropic served during evaluation may differ from the published model's full capability tier, and re-runs cannot guarantee the higher-tier variant is selected. These numbers may understate the model's true ceiling. Learn more.
Pass rate vs input tokens
Hover any dot for the harness, model, and exact metrics.
Industry coverage


Motion & VFX


3D modeling


Game development


Mold flow analysis


Architectural modeling


Brain imaging
Sample tasks
A selection from the 147 public ALE-V1 tasks across 14 task categories. Each task ships with a sandboxed environment, a hidden reference, and a deterministic grader. Slugs link to the task source.
business finance
sec_10k_financial_parsing
Parse a SEC 10-K filing into a structured financial schema. Multi-step extraction, table normalization, and cross-reference validation against the original document.
business finance
health medicine
Clinical_Variant_Annotation
life sciences
computing math
visual media
Methodology
Metrics
Pass Rate — fraction of tasks the agent fully completed (strict success). Score — average graded outcome across all tasks, including partial credit. Both computed by deterministic graders against hidden references.
Every ~6 months, a new public subset releases with fresh instances. Private tasks rotate into the public pool, retired public tasks rotate out, and held-out private tasks score the official leaderboard, to limit benchmark leakage.
Reference Harnesses
Two open harnesses ship with the framework: the official Claude Code CLI and the in-tree OpenClaw harness. Submissions also include Codex, Cursor CLI, Droid, Gemini CLI, Grok CLI, and the ALE Claw reference harness.
Resources
Acknowledgments
Agents’ Last Exam is co-led by UC Berkeley RDI and the RDI Foundation, with funding support and contributions from Snorkel AI via the Open Benchmarks Grants program. The benchmark draws task contributions from 300+ industry experts across 44 academic institutions (MIT, Harvard, Stanford, UC Berkeley, Oxford, CMU, Caltech, ETH Zurich, Yale, Columbia, and more) and industry organizations including Goldman Sachs, JPMorgan, Morgan Stanley, PIMCO, Meta, Amazon, Adobe, Oracle, Hippocratic AI, and HubSpot.
Advisory Committee includes George Em Karniadakis (Brown), Tapio Schneider (Caltech), Teresa Head-Gordon (UC Berkeley), Laure Zanna (NYU), Jack Gallant (UC Berkeley), Tarek Zohdi (UC Berkeley), Ida Sim (UCSF), Arvind Rao (U Michigan), Kaan Ozbay (NYU), Carl Boettiger (UC Berkeley), Kyle Steinfeld (UC Berkeley), Yamini Rangan (HubSpot), and Bradley Rothenberg (nTop).
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.

