Back to benchmarks
Released June 09, 2026
Open Benchmarks Grants

Agents' Last Exam

A benchmark for evaluating AI agents on long-horizon, economically valuable professional workflows with verifiable outcomes. 55 sub-industries, 1,500+ tasks toward a 5,000-task target, sourced and validated by 300+ industry experts.

Built with
Snorkel AI logo lockup mono white outline pngImageImage
Overview

Agents’ Last Exam (ALE) is building the broadest-coverage agent evaluation benchmark to date, measuring performance on long-horizon, economically valuable tasks with verifiable outcomes. The benchmark covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy), spanning all 55 targeted sub-industries.

ALE-V1 ships 147 reference tasks across 55 industries as the current public subset of a 1,500+ task corpus. Many tasks require private data or licensed software and remain in a separate private pool. ALE uses rolling evaluation: every ~6 months a new public subset is published with fresh instances, while private tasks rotate in and retired public tasks rotate out, to limit benchmark leakage.

Leaderboard

Rank Harness Model Pass Rate Score Runtime Input Tokens Output Tokens
1 Codex gpt-5-5
24%
42.8%
369h 50m 1.6B 7.2M
2 ALE Claw gpt-5-5
23%
45.8%
47h 20m 334.5M 2.4M
3 Claude Code claude-fable-5
22%
40.5%
197h 38m 886.6M 9.6M
4 OpenClaw gpt-5-5
21.1%
41%
92h 51m 471.1M 3.3M
5 Cursor CLI gpt-5-5
20.7%
39.6%
82h 13m 154.2M 1.7M
6 OpenClaw gpt-5-4
20.5%
37.3%
162h 16m 545.5M 8.7M
7 Cursor CLI composer-2-5
20.4%
38.5%
249h 59m 338.8M 2.9M
8 Droid gpt-5-5
19.1%
38.6%
88h 10m 243.2M 2.3M
9 ALE Claw claude-opus-4-7
18.4%
40.5%
87h 54m 1.4B 5.7M
10 Claude Code claude-opus-4-8
15.8%
37.2%
451h 15m 452.0M 3.8M
11 Gemini CLI gemini-3-1-pro-preview
15.8%
32%
272h 28m 1.2B 3.5M
12 OpenClaw claude-opus-4-7
15.1%
34.6%
143h 19m 833.0M 4.1M
13 OpenClaw claude-opus-4-6
14.1%
32.5%
164h 33m 441.2M 4.2M
14 OpenClaw gemini-3-1-pro-preview
14.1%
28.7%
174h 18m 3.6B 4.0M
15 Claude Code claude-opus-4-7
13.2%
35.1%
50h 38m 456.4M 3.7M
16 Droid claude-opus-4-7
12.8%
31%
35h 54m 356.5M 2.8M
17 OpenClaw deepseek-v4-pro
12.4%
27.6%
233h 3m 893.3M 8.7M
18 OpenClaw qwen3-7-max
11.8%
31.1%
190h 45m 1.4B 17.6M
19 ALE Claw gpt-5-4
11.8%
28.2%
65h 6m 1.1B 2.1M
20 OpenClaw glm-5-1
11.5%
28.2%
321h 11m 1.4B 11.4M
21 OpenClaw kimi-k2-6
9.2%
21.7%
292h 52m 453.4M 9.3M
22 OpenClaw qwen3-6-plus
8.6%
24.3%
258h 22m 1.2B 12.6M
23 OpenClaw mimo-v2-5
8.6%
23.6%
194h 48m 730.8M 7.2M
24 Codex gpt-5-4
7.2%
12.8%
49h 6m 210.7M 3.3M
25 Grok CLI grok-4-3
6.6%
20.1%
62h 38m 232.4M 2.4M
26 OpenClaw minimax-m2-7
5.9%
14.2%
190h 12m 367.5M 6.0M
27 Grok CLI grok-3
4.6%
12.6%
32h 11m 55.7M 516.5K
28 OpenClaw grok-4-3
4.3%
15.5%
176h 50m 311.9M 5.0M
29 Gemini CLI gemini-3-5-flash
0%
0%
8m 28s 2.2M 38.8K
30 OpenClaw CLI qwen3-7-max
0%
0%
1h 36m 1.4M 10.9K

Sample tasks

A selection from the 147 public ALE-V1 tasks across 14 task categories. Each task ships with a sandboxed environment, a hidden reference, and a deterministic grader. Slugs link to the task source.

business_finance

sec_10k_financial_parsing

Parse a SEC 10-K filing into a structured financial schema. Multi-step extraction, table normalization, and cross-reference validation against the original document.

business_finance

financial_stmt_reconstruction_aapl_fy2024
Reconstruct Apple’s FY2024 financial statement from primary disclosure documents. Validates whether the agent surfaces the exact reported figures and footnote-relevant adjustments.
engineering
mold-flow / 220089
Set up a Moldex3D mold-flow simulation, run it to convergence, and report fill time / pressure metrics matching the held-out reference run.
health_medicine

Clinical_Variant_Annotation

Annotate a clinical variant set using standard pipelines (VEP, ClinVar, etc.) and produce a report graded against a curated reference.
life_sciences
WGS_Variant_Calling
Run a whole-genome sequencing variant-calling pipeline and produce VCF output. Scored on precision and recall against a held-out truth VCF.
computing_math
k8s_payment_api_root_cause_analysis
Diagnose a failing payment API in a Kubernetes cluster. Multi-hop investigation across logs, metrics, manifests, and traces, scored on the correct root-cause identification.
visual_media
video_storyboard_001
Build a shot-by-shot video storyboard from a brief, formatted to industry conventions. Graded on coverage, continuity, and adherence to the reference shot list.
legal
legal_dr_fees_01
Compute legal fees from a billing register according to jurisdictional rules. Tests structured extraction plus rule-following against an authoritative reference total.

Methodology

Metrics

Pass Rate — fraction of tasks the agent fully completed (strict success). Score — average graded outcome across all tasks, including partial credit. Both computed by deterministic graders against hidden references.

Verifiable Outcomes
Hidden references plus deterministic graders, not LLM-as-a-judge. Tasks sourced from real professional workflows (After Effects, Siemens NX, Unreal Engine, Moldex3D, Rhino 3D, FSLeyes, and 49 more applications) and validated by domain experts before inclusion.
Rolling Evaluation

Every ~6 months, a new public subset releases with fresh instances. Private tasks rotate into the public pool, retired public tasks rotate out, and held-out private tasks score the official leaderboard, to limit benchmark leakage.

Reference Harnesses

Two open harnesses ship with the framework: the official Claude Code CLI and the in-tree OpenClaw harness. Submissions also include Codex, Cursor CLI, Droid, Gemini CLI, Grok CLI, and the ALE Claw reference harness.

Acknowledgments

Agents’ Last Exam is co-led by UC Berkeley RDI and the RDI Foundation, with funding support and contributions from Snorkel AI via the Open Benchmarks Grants program. The benchmark draws task contributions from 300+ industry experts across 44 academic institutions (MIT, Harvard, Stanford, UC Berkeley, Oxford, CMU, Caltech, ETH Zurich, Yale, Columbia, and more) and industry organizations including Goldman Sachs, JPMorgan, Morgan Stanley, PIMCO, Meta, Amazon, Adobe, Oracle, Hippocratic AI, and HubSpot.

Advisory Committee includes George Em Karniadakis (Brown), Tapio Schneider (Caltech), Teresa Head-Gordon (UC Berkeley), Laure Zanna (NYU), Jack Gallant (UC Berkeley), Tarek Zohdi (UC Berkeley), Ida Sim (UCSF), Arvind Rao (U Michigan), Kaan Ozbay (NYU), Carl Boettiger (UC Berkeley), Kyle Steinfeld (UC Berkeley), Yamini Rangan (HubSpot), and Bradley Rothenberg (nTop).

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.