Open Benchmarks Grants
agent-le-logo

Agents' Last Exam

A benchmark for evaluating AI agents on long-horizon, economically valuable professional workflows with verifiable outcomes. 55 sub-industries, 1,500+ tasks toward a 5,000-task target, sourced and validated by 300+ industry experts.

Built with
ImageImageSnorkel AI logo lockup mono white outline png
Overview

Agents’ Last Exam (ALE) is building the broadest-coverage agent evaluation benchmark to date, measuring performance on long-horizon, economically valuable tasks with verifiable outcomes. The benchmark covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy), spanning all 55 targeted sub-industries.

ALE-V1 ships 147 reference tasks across 55 industries as the current public subset of a 1,500+ task corpus. Many tasks require private data or licensed software and remain in a separate private pool. ALE uses rolling evaluation: every ~6 months a new public subset is published with fresh instances, while private tasks rotate in and retired public tasks rotate out, to limit benchmark leakage.

Leaderboard

Rank Harness Model Effort Pass Rate Score Est. Cost Runtime Input Tokens Output Tokens
1 Claude Code claude-opus-4-8 Max
27%
45.2%
$3,985 191h 6m 1.7B 22.1M
2 Codex gpt-5-5 XHigh
26.3%
47.7%
$494 54h 54m 530.2M 4.2M
3 Codex gpt-5-5 High
24.3%
45.3%
$389 61h 48m 412.8M 3.3M
4 ALE Claw gpt-5-5 High
23%
45.8%
$311 35h 41m 334.5M 2.4M
5 Claude Code claude-opus-4-8 XHigh
22.4%
42.3%
$2,815 159h 36m 1.2B 16.5M
6 Claude Code claude-fable-5 Adaptive
22%
40.5%
$2,315 113h 14m 880.0M 9.4M
7 OpenClaw gpt-5-5 High
21.1%
41%
$449 83h 1m 471.1M 3.3M
8 Cursor CLI gpt-5-5 Medium
20.7%
39.6%
$174 58h 52m 148.5M 1.7M
9 OpenClaw gpt-5-4 High
20.5%
37.3%
$276 128h 22m 490.5M 7.4M
10 Claude Code claude-opus-4-8 High
20.4%
38.7%
$1,626 86h 0m 1.8B 6.4M
11 Cursor CLI composer-2-5 Adaptive
20.4%
38.5%
$177 90h 33m 338.8M 2.9M
12 Claude Code seed-2.1-pro High
19.5%
41.4%
$936 167h 54m 2.1B 8.4M
13 Droid gpt-5-5 High
19.1%
38.6%
$244 64h 3m 235.8M 2.2M
14 ALE Claw claude-opus-4-7 High
18.4%
40.5%
$1,144 78h 9m 1.4B 5.7M
15 Gemini CLI gemini-3-1-pro-preview High
15.8%
32%
$2,018 96h 32m 1.2B 3.5M
16 OpenClaw claude-opus-4-7 High
15.1%
34.6%
$1,712 124h 49m 818.4M 4.0M
17 OpenClaw gemini-3-1-pro-preview High
14.1%
28.7%
$2,969 147h 12m 3.4B 3.7M
18 Claude Code claude-opus-4-7 High
13.2%
35.1%
$1,793 42h 57m 456.4M 3.7M
19 OpenClaw seed-2.1-pro High
13.1%
32.2%
$440 196h 42m 1.6B 20.4M
20 Droid claude-opus-4-7 High
12.8%
31%
$1,356 28h 16m 352.1M 2.7M
21 OpenClaw deepseek-v4-pro High
12.4%
27.6%
$275 131h 38m 516.3M 4.8M
22 OpenClaw qwen3-7-max High
11.8%
31.1%
$664 104h 37m 1.4B 17.6M
23 ALE Claw gpt-5-4 High
11.8%
28.2%
$335 55h 32m 1.1B 2.1M
24 OpenClaw glm-5-1 High
11.5%
28.2%
$475 172h 24m 818.7M 6.4M
25 OpenClaw kimi-k2-6 High
9.2%
21.7%
$124 183h 58m 296.3M 6.1M
26 OpenClaw qwen3-6-plus High
8.6%
24.3%
$255 139h 43m 743.1M 7.2M
27 OpenClaw mimo-v2-5 High
8.6%
23.6%
$45.20 108h 46m 415.1M 4.1M
28 Grok CLI grok-4-3 -
6.6%
20.1%
$287 48h 7m 225.0M 2.3M
29 OpenClaw minimax-m2-7 High
5.9%
14.2%
$27.04 145h 16m 281.3M 4.4M
30 OpenClaw grok-4-3 High
4.3%
15.5%
$80.44 79h 27m 169.2M 2.8M
Rank Harness Model Effort Pass Rate Score Est. Cost Runtime Input Tokens Output Tokens
1 Codex gpt-5-5 XHigh
26.7%
49%
$286 28h 40m 305.5M 2.6M
2 Claude Code claude-opus-4-8 Max
25.7%
44.1%
$1,953 117h 29m 642.2M 15.3M
3 ALE Claw gpt-5-5 High
24.8%
48.1%
$110 16h 51m 77.2M 1.3M
4 OpenClaw gpt-5-5 High
23.8%
44.4%
$221 31h 54m 213.1M 2.2M
5 Codex gpt-5-5 High
23.3%
44.3%
$344 29h 8m 375.1M 2.4M
6 OpenClaw gpt-5-4 High
22.5%
41.8%
$171 50h 44m 287.6M 5.0M
7 Claude Code claude-fable-5 Adaptive
21.9%
44.6%
$880 74h 44m 301.4M 6.9M
8 Cursor CLI gpt-5-5 Medium
21.9%
41.8%
$98.19 14h 34m 68.9M 1.1M
9 Claude Code claude-opus-4-8 High
21.9%
42.7%
$1,016 59h 52m 1.1B 4.2M
10 ALE Claw claude-opus-4-7 High
20%
43.3%
$427 44h 45m 401.0M 3.1M
11 Claude Code claude-opus-4-8 XHigh
20%
42.1%
$1,713 92h 27m 641.0M 11.9M
12 Cursor CLI composer-2-5 Adaptive
19%
40.8%
$81.00 71h 28m 153.2M 1.8M
13 Droid gpt-5-5 High
19%
39.5%
$138 37h 41m 118.1M 1.5M
14 Claude Code seed-2.1-pro High
19%
41.4%
$602 98h 42m 1.1B 5.8M
15 Gemini CLI gemini-3-1-pro-preview High
17.1%
36.2%
$1,010 54h 56m 450.8M 2.4M
16 OpenClaw claude-opus-4-7 High
16.2%
37.8%
$983 54h 8m 527.4M 2.8M
17 OpenClaw gemini-3-1-pro-preview High
15.7%
31.7%
$1,304 66h 29m 1.4B 2.4M
18 Claude Code claude-opus-4-7 High
14.3%
38%
$1,165 26h 8m 281.0M 2.7M
19 Droid claude-opus-4-7 High
14.3%
33.7%
$710 19h 17m 176.9M 1.7M
20 OpenClaw deepseek-v4-pro High
14.1%
30.8%
$163 57h 2m 317.4M 3.4M
21 OpenClaw qwen3-7-max High
13.3%
34%
$534 36h 32m 1.0B 15.8M
22 ALE Claw gpt-5-4 High
13.3%
33.1%
$169 26h 35m 541.7M 1.1M
23 Forgecode claude-sonnet-4-6 Medium
13.3%
28.5%
$103 53h 39m 137.4M 1.8M
24 Hermes claude-sonnet-4-6 Medium
13.2%
32%
$440 25h 27m 197.3M 2.0M
25 OpenClaw glm-5-1 High
12.9%
30.8%
$248 77h 42m 423.9M 3.7M
26 Terminus 2 claude-sonnet-4-6 Off
11.9%
30.9%
$321 73h 13m 762.9M 3.2M
27 OpenClaw claude-sonnet-4-6 High
11.4%
31%
$181 33h 51m 247.0M 1.9M
28 OpenClaw seed-2.1-pro High
11.4%
33.2%
$440 101h 54m 1.4B 18.4M
29 OpenClaw qwen3-6-plus High
10.5%
28.6%
$130 67h 28m 371.0M 4.6M
30 OpenClaw mimo-v2-5 High
10%
26.5%
$33.40 52h 47m 306.7M 3.1M
31 OpenHands claude-sonnet-4-6 Not reported
9%
19.8%
$247 36h 21m 354.7M 4.5M
32 OpenClaw kimi-k2-6 High
8.1%
21.2%
$91.13 90h 11m 223.3M 4.5M
33 Grok CLI grok-4-3 -
7.6%
24.3%
$208 41h 12m 162.8M 1.7M
34 OpenClaw minimax-m2-7 High
5.7%
14.6%
$22.46 53h 35m 246.8M 3.2M
35 OpenClaw grok-4-3 High
4.3%
17.5%
$61.00 36h 30m 136.4M 2.1M
Rank Harness Model Effort Pass Rate Score Est. Cost Runtime Input Tokens Output Tokens
1 Codex gpt-5-5 XHigh
43.3%
71.5%
$129 17h 17m 109.0M 1.5M
2 Claude Code claude-opus-4-8 Max
43.3%
64%
$1,106 53h 49m 310.5M 8.3M
3 Codex gpt-5-5 High
38.1%
64.7%
$242 29h 48m 240.4M 1.6M
4 OpenClaw gpt-5-5 High
35.8%
65.7%
$218 41h 19m 223.0M 1.5M
5 Claude Code claude-opus-4-8 XHigh
35.8%
62.7%
$683 46h 41m 201.3M 6.0M
6 Claude Code claude-opus-4-8 High
34.3%
56.8%
$446 21h 35m 397.7M 2.2M
7 Claude Code claude-fable-5 Adaptive
34.3%
63.4%
$947 33h 43m 341.7M 3.1M
8 Cursor CLI composer-2-5 Adaptive
34.3%
61.1%
$49.70 23h 6m 94.1M 1.1M
9 OpenClaw gpt-5-4 High
33.6%
57.8%
$68.88 40h 56m 86.0M 2.4M
10 ALE Claw gpt-5-5 High
32.8%
67.4%
$148 14h 57m 167.2M 1.0M
11 Cursor CLI gpt-5-5 Medium
32.1%
60.8%
$68.46 30h 10m 51.8M 741.1K
12 Droid gpt-5-5 High
29.9%
58.2%
$106 23h 26m 98.4M 962.6K
13 ALE Claw claude-opus-4-7 High
28.4%
60.5%
$312 20h 40m 359.2M 1.9M
14 Droid claude-opus-4-7 High
27.6%
60.2%
$738 16h 6m 176.2M 1.5M
15 Claude Code seed-2.1-pro High
26.9%
58.9%
$376 65h 18m 697.7M 3.1M
16 OpenClaw claude-opus-4-7 High
26.9%
56.5%
$508 47h 42m 201.7M 1.5M
17 Gemini CLI gemini-3-1-pro-preview High
26.9%
53.5%
$342 28h 46m 239.4M 1.1M
18 OpenClaw gemini-3-1-pro-preview High
26.1%
48.3%
$575 48h 37m 616.7M 938.9K
19 Claude Code claude-opus-4-7 High
20.9%
54.3%
$496 16h 17m 120.4M 1.1M
20 ALE Claw gpt-5-4 High
20.9%
44.3%
$66.17 17h 22m 178.6M 684.8K
21 OpenClaw glm-5-1 High
20.1%
45.6%
$108 62h 17m 183.5M 1.6M
22 OpenClaw deepseek-v4-pro High
19.9%
43.8%
$109 58h 50m 208.2M 1.9M
23 OpenClaw qwen3-7-max High
17.9%
46.9%
$247 39h 19m 502.0M 7.3M
24 OpenClaw seed-2.1-pro High
17.9%
46.9%
$201 78h 24m 640.9M 8.9M
25 OpenClaw kimi-k2-6 High
15.7%
35.1%
$46.55 61h 6m 118.5M 2.1M
26 OpenClaw qwen3-6-plus High
12.7%
35.8%
$89.59 60h 13m 259.7M 2.7M
27 OpenClaw mimo-v2-5 High
11.9%
35.1%
$12.77 43h 37m 105.4M 1.5M
28 OpenClaw minimax-m2-7 High
10.4%
24.5%
$8.86 63h 22m 98.2M 1.4M
29 Grok CLI grok-4-3 -
9%
30.4%
$127 21h 59m 99.8M 1.0M
30 OpenClaw grok-4-3 High
6.7%
24.4%
$35.01 31h 15m 73.8M 1.2M
Rank Harness Model Effort Pass Rate Score Est. Cost Runtime Input Tokens Output Tokens
1 Claude Code seed-2.1-pro High
24.1%
40.1%
$289 49h 12m 639.5M 3.2M
2 Codex gpt-5-5 XHigh
23.6%
40.3%
$129 14h 43m 120.5M 1.4M
3 Claude Code claude-opus-4-8 Max
23.6%
43.1%
$1,324 62h 59m 524.3M 8.2M
4 ALE Claw gpt-5-5 High
23.6%
41.1%
$70.79 8h 34m 55.1M 756.3K
5 Codex gpt-5-5 High
22.7%
36%
$157 19h 9m 147.0M 1.3M
6 Claude Code claude-fable-5 Adaptive
20.9%
34.1%
$608 35h 41m 327.7M 3.7M
7 Cursor CLI gpt-5-5 Medium
20%
32.7%
$50.38 17h 29m 36.9M 563.2K
8 Claude Code claude-opus-4-8 XHigh
20%
38.3%
$856 49h 37m 300.4M 5.6M
9 OpenClaw gpt-5-4 High
19.4%
34.3%
$125 58h 25m 241.4M 3.0M
10 ALE Claw claude-opus-4-7 High
18.2%
36.6%
$258 16h 1m 300.2M 1.8M
11 OpenClaw gpt-5-5 High
18.2%
32.1%
$104 23h 21m 100.4M 1.1M
12 Cursor CLI composer-2-5 Adaptive
18.2%
30.8%
$67.71 27h 7m 130.0M 1.1M
13 Claude Code claude-opus-4-8 High
18.2%
37.8%
$569 26h 29m 698.1M 2.3M
14 Droid gpt-5-5 High
16.4%
33.4%
$70.65 15h 50m 59.6M 823.2K
15 OpenClaw seed-2.1-pro High
13.5%
23.6%
$171 45h 24m 572.9M 7.7M
16 Claude Code claude-opus-4-7 High
12.7%
29.1%
$747 12h 25m 202.0M 1.5M
17 Gemini CLI gemini-3-1-pro-preview High
12.7%
26.4%
$962 26h 32m 481.7M 1.7M
18 OpenClaw claude-opus-4-7 High
10.9%
27.5%
$393 30h 26m 174.1M 1.2M
19 OpenClaw qwen3-7-max High
10.9%
27.2%
$280 36h 42m 585.1M 7.1M
20 OpenClaw deepseek-v4-pro High
10.9%
23.8%
$69.75 39h 2m 119.0M 1.6M
21 OpenClaw gemini-3-1-pro-preview High
10.9%
23.6%
$1,393 62h 53m 1.6B 1.7M
22 ALE Claw gpt-5-4 High
9.1%
22.9%
$179 22h 40m 596.0M 936.5K
23 OpenClaw glm-5-1 High
9.1%
21.8%
$201 65h 41m 345.3M 2.7M
24 OpenClaw mimo-v2-5 High
9.1%
20.8%
$18.43 36h 14m 180.1M 1.5M
25 OpenClaw qwen3-6-plus High
8.2%
22.9%
$112 51h 37m 326.2M 3.2M
26 Grok CLI grok-4-3 -
7.3%
17%
$97.08 15h 30m 75.8M 932.5K
27 OpenClaw kimi-k2-6 High
6.4%
18.2%
$41.45 74h 15m 93.9M 2.5M
28 OpenClaw grok-4-3 High
3.6%
12.9%
$25.57 25h 7m 52.2M 1.0M
29 Droid claude-opus-4-7 High
3.6%
10.9%
$167 3h 8m 39.9M 472.1K
30 OpenClaw minimax-m2-7 High
3.6%
8.4%
$9.87 46h 56m 99.2M 1.8M
Rank Harness Model Effort Pass Rate Score Est. Cost Runtime Input Tokens Output Tokens
1 ALE Claw gpt-5-5 High
2.6%
12.8%
$107 13h 44m 126.3M 704.4K
2 Droid gpt-5-5 High
2.6%
11.3%
$75.62 26h 27m 83.6M 551.8K
3 Cursor CLI gpt-5-5 Medium
2.6%
10.7%
$61.49 17h 12m 63.9M 456.4K
4 Claude Code claude-opus-4-8 XHigh
2.6%
10.4%
$1,395 76h 53m 739.8M 5.9M
5 Claude Code claude-opus-4-8 Max
2.6%
14.4%
$1,805 89h 34m 961.0M 7.3M
6 Codex gpt-5-5 High
0%
11.2%
$179 28h 42m 198.6M 1.1M
7 Codex gpt-5-5 XHigh
0%
14.6%
$262 26h 44m 325.5M 1.7M
8 OpenClaw gpt-5-5 High
0%
10.9%
$155 24h 52m 178.2M 997.0K
9 Claude Code claude-opus-4-7 High
0%
10%
$685 17h 6m 170.1M 1.4M
10 Claude Code seed-2.1-pro High
0%
10.8%
$319 59h 48m 814.8M 2.8M
11 OpenClaw seed-2.1-pro High
0%
10.4%
$84.70 81h 24m 455.8M 4.8M
12 Cursor CLI composer-2-5 Adaptive
0%
8.8%
$77.87 45h 32m 151.0M 948.3K
13 ALE Claw gpt-5-4 High
0%
8.1%
$100 17h 39m 321.4M 540.7K
14 Droid claude-opus-4-7 High
0%
8%
$496 10h 18m 145.7M 920.1K
15 ALE Claw claude-opus-4-7 High
0%
7.9%
$594 43h 8m 707.3M 2.2M
16 OpenClaw gpt-5-4 High
0%
7.3%
$107 36h 16m 206.9M 2.5M
17 OpenClaw qwen3-7-max High
0%
6.4%
$173 34h 42m 377.8M 4.1M
18 OpenClaw glm-5-1 High
0%
6.2%
$204 57h 10m 354.0M 2.5M
19 Claude Code claude-fable-5 Adaptive
0%
5.2%
$838 54h 32m 231.2M 3.3M
20 Claude Code claude-opus-4-8 High
0%
7%
$735 45h 41m 841.1M 2.2M
21 OpenClaw qwen3-6-plus High
0%
5%
$78.19 35h 46m 228.9M 1.9M
22 OpenClaw claude-opus-4-7 High
0%
4.3%
$842 50h 0m 452.9M 1.5M
23 OpenClaw mimo-v2-5 High
0%
3.4%
$14.92 33h 3m 135.8M 1.3M
24 OpenClaw gemini-3-1-pro-preview High
0%
3.1%
$1,200 45h 9m 1.4B 1.1M
25 OpenClaw deepseek-v4-pro High
0%
2.5%
$108 39h 7m 208.8M 1.7M
26 Grok CLI grok-4-3 -
0%
2.3%
$71.44 11h 35m 56.2M 477.3K
27 OpenClaw grok-4-3 High
0%
2.3%
$22.18 25h 4m 47.2M 621.3K
28 OpenClaw kimi-k2-6 High
0%
1.6%
$38.87 60h 46m 88.1M 1.8M
29 OpenClaw minimax-m2-7 High
0%
1.3%
$9.15 44h 14m 87.7M 1.5M
30 Gemini CLI gemini-3-1-pro-preview High
0%
0.9%
$733 42h 12m 497.3M 821.0K

*claude-fable-5: the variant Anthropic served during evaluation may differ from the published model's full capability tier, and re-runs cannot guarantee the higher-tier variant is selected. These numbers may understate the model's true ceiling. Learn more.

Pass rate vs input tokens

Efficiency frontier for this evaluation split. Up and to the left is better (higher pass rate at lower input-token spend). Several harness/model pairs reach the top of the split at a fraction of the cost of others; some pairs spend an order of magnitude more tokens without a corresponding pass-rate gain.

Hover any dot for the harness, model, and exact metrics. 

Industry coverage

Six representative task families across the 55 sub-industries ALE covers.
Image

Motion & VFX

Animation and visual effects production tasks in Adobe After Effects.
Image

3D modeling

3D model creation and editing tasks in Siemens NX.
Image

Game development

Scene setup, asset placement, and rendering tasks in Unreal Engine.
Image

Mold flow analysis

Simulation and mold flow analysis tasks in Moldex3D manufacturing software.
Image

Architectural modeling

3D modeling and energy analysis workflows in Rhino 3D for urban design.
Image

Brain imaging

Neuroimaging analysis and brain structure segmentation tasks in FSLeyes.

Sample tasks

A selection from the 147 public ALE-V1 tasks across 14 task categories. Each task ships with a sandboxed environment, a hidden reference, and a deterministic grader. Slugs link to the task source.

business finance

sec_10k_financial_parsing

Parse a SEC 10-K filing into a structured financial schema. Multi-step extraction, table normalization, and cross-reference validation against the original document.

business finance

financial_stmt_reconstruction_aapl_fy2024
Reconstruct Apple’s FY2024 financial statement from primary disclosure documents. Validates whether the agent surfaces the exact reported figures and footnote-relevant adjustments.
engineering
mold-flow / 220089
Set up a Moldex3D mold-flow simulation, run it to convergence, and report fill time / pressure metrics matching the held-out reference run.

health medicine

Clinical_Variant_Annotation

Annotate a clinical variant set using standard pipelines (VEP, ClinVar, etc.) and produce a report graded against a curated reference.

life sciences

WGS_Variant_Calling
Run a whole-genome sequencing variant-calling pipeline and produce VCF output. Scored on precision and recall against a held-out truth VCF.

computing math

k8s_payment_api_root_cause_analysis
Diagnose a failing payment API in a Kubernetes cluster. Multi-hop investigation across logs, metrics, manifests, and traces, scored on the correct root-cause identification.

visual media

video_storyboard_001
Build a shot-by-shot video storyboard from a brief, formatted to industry conventions. Graded on coverage, continuity, and adherence to the reference shot list.
legal
legal_dr_fees_01
Compute legal fees from a billing register according to jurisdictional rules. Tests structured extraction plus rule-following against an authoritative reference total.

Methodology

Metrics

Pass Rate — fraction of tasks the agent fully completed (strict success). Score — average graded outcome across all tasks, including partial credit. Both computed by deterministic graders against hidden references.

Verifiable Outcomes
Hidden references plus deterministic graders, not LLM-as-a-judge. Tasks sourced from real professional workflows (After Effects, Siemens NX, Unreal Engine, Moldex3D, Rhino 3D, FSLeyes, and 49 more applications) and validated by domain experts before inclusion.
Rolling Evaluation

Every ~6 months, a new public subset releases with fresh instances. Private tasks rotate into the public pool, retired public tasks rotate out, and held-out private tasks score the official leaderboard, to limit benchmark leakage.

Reference Harnesses

Two open harnesses ship with the framework: the official Claude Code CLI and the in-tree OpenClaw harness. Submissions also include Codex, Cursor CLI, Droid, Gemini CLI, Grok CLI, and the ALE Claw reference harness.

Acknowledgments

Agents’ Last Exam is co-led by UC Berkeley RDI and the RDI Foundation, with funding support and contributions from Snorkel AI via the Open Benchmarks Grants program. The benchmark draws task contributions from 300+ industry experts across 44 academic institutions (MIT, Harvard, Stanford, UC Berkeley, Oxford, CMU, Caltech, ETH Zurich, Yale, Columbia, and more) and industry organizations including Goldman Sachs, JPMorgan, Morgan Stanley, PIMCO, Meta, Amazon, Adobe, Oracle, Hippocratic AI, and HubSpot.

Advisory Committee includes George Em Karniadakis (Brown), Tapio Schneider (Caltech), Teresa Head-Gordon (UC Berkeley), Laure Zanna (NYU), Jack Gallant (UC Berkeley), Tarek Zohdi (UC Berkeley), Ida Sim (UCSF), Arvind Rao (U Michigan), Kaan Ozbay (NYU), Carl Boettiger (UC Berkeley), Kyle Steinfeld (UC Berkeley), Yamini Rangan (HubSpot), and Bradley Rothenberg (nTop).

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.