Open Benchmarks Grants

Agents' Last Exam

A benchmark for evaluating AI agents on long-horizon, economically valuable professional workflows with verifiable outcomes. 55 sub-industries, 1,500+ tasks toward a 5,000-task target, sourced and validated by 300+ industry experts.

Built with

Overview

Agents’ Last Exam (ALE) is building the broadest-coverage agent evaluation benchmark to date, measuring performance on long-horizon, economically valuable tasks with verifiable outcomes. The benchmark covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy), spanning all 55 targeted sub-industries.

ALE-V1 ships 147 reference tasks across 55 industries as the current public subset of a 1,500+ task corpus. Many tasks require private data or licensed software and remain in a separate private pool. ALE uses rolling evaluation: every ~6 months a new public subset is published with fresh instances, while private tasks rotate in and retired public tasks rotate out, to limit benchmark leakage.

Leaderboard

Rank	Harness	Model	Pass Rate	Score	Runtime	Input Tokens	Output Tokens
1	Codex	gpt-5-5	24%	42.8%	369h 50m	1.6B	7.2M
2	ALE Claw	gpt-5-5	23%	45.8%	47h 20m	334.5M	2.4M
3	Claude Code	claude-fable-5	22%	40.5%	197h 38m	886.6M	9.6M
4	OpenClaw	gpt-5-5	21.1%	41%	92h 51m	471.1M	3.3M
5	Cursor CLI	gpt-5-5	20.7%	39.6%	82h 13m	154.2M	1.7M
6	OpenClaw	gpt-5-4	20.5%	37.3%	162h 16m	545.5M	8.7M
7	Cursor CLI	composer-2-5	20.4%	38.5%	249h 59m	338.8M	2.9M
8	Droid	gpt-5-5	19.1%	38.6%	88h 10m	243.2M	2.3M
9	ALE Claw	claude-opus-4-7	18.4%	40.5%	87h 54m	1.4B	5.7M
10	Claude Code	claude-opus-4-8	15.8%	37.2%	451h 15m	452.0M	3.8M
11	Gemini CLI	gemini-3-1-pro-preview	15.8%	32%	272h 28m	1.2B	3.5M
12	OpenClaw	claude-opus-4-7	15.1%	34.6%	143h 19m	833.0M	4.1M
13	OpenClaw	claude-opus-4-6	14.1%	32.5%	164h 33m	441.2M	4.2M
14	OpenClaw	gemini-3-1-pro-preview	14.1%	28.7%	174h 18m	3.6B	4.0M
15	Claude Code	claude-opus-4-7	13.2%	35.1%	50h 38m	456.4M	3.7M
16	Droid	claude-opus-4-7	12.8%	31%	35h 54m	356.5M	2.8M
17	OpenClaw	deepseek-v4-pro	12.4%	27.6%	233h 3m	893.3M	8.7M
18	OpenClaw	qwen3-7-max	11.8%	31.1%	190h 45m	1.4B	17.6M
19	ALE Claw	gpt-5-4	11.8%	28.2%	65h 6m	1.1B	2.1M
20	OpenClaw	glm-5-1	11.5%	28.2%	321h 11m	1.4B	11.4M
21	OpenClaw	kimi-k2-6	9.2%	21.7%	292h 52m	453.4M	9.3M
22	OpenClaw	qwen3-6-plus	8.6%	24.3%	258h 22m	1.2B	12.6M
23	OpenClaw	mimo-v2-5	8.6%	23.6%	194h 48m	730.8M	7.2M
24	Codex	gpt-5-4	7.2%	12.8%	49h 6m	210.7M	3.3M
25	Grok CLI	grok-4-3	6.6%	20.1%	62h 38m	232.4M	2.4M
26	OpenClaw	minimax-m2-7	5.9%	14.2%	190h 12m	367.5M	6.0M
27	Grok CLI	grok-3	4.6%	12.6%	32h 11m	55.7M	516.5K
28	OpenClaw	grok-4-3	4.3%	15.5%	176h 50m	311.9M	5.0M
29	Gemini CLI	gemini-3-5-flash	0%	0%	8m 28s	2.2M	38.8K
30	OpenClaw CLI	qwen3-7-max	0%	0%	1h 36m	1.4M	10.9K

Sample tasks

A selection from the 147 public ALE-V1 tasks across 14 task categories. Each task ships with a sandboxed environment, a hidden reference, and a deterministic grader. Slugs link to the task source.

business_finance

sec_10k_financial_parsing

Parse a SEC 10-K filing into a structured financial schema. Multi-step extraction, table normalization, and cross-reference validation against the original document.

business_finance

financial_stmt_reconstruction_aapl_fy2024

Reconstruct Apple’s FY2024 financial statement from primary disclosure documents. Validates whether the agent surfaces the exact reported figures and footnote-relevant adjustments.

engineering

mold-flow / 220089

Set up a Moldex3D mold-flow simulation, run it to convergence, and report fill time / pressure metrics matching the held-out reference run.

health_medicine

Clinical_Variant_Annotation

Annotate a clinical variant set using standard pipelines (VEP, ClinVar, etc.) and produce a report graded against a curated reference.

life_sciences

WGS_Variant_Calling

Run a whole-genome sequencing variant-calling pipeline and produce VCF output. Scored on precision and recall against a held-out truth VCF.

computing_math

k8s_payment_api_root_cause_analysis

Diagnose a failing payment API in a Kubernetes cluster. Multi-hop investigation across logs, metrics, manifests, and traces, scored on the correct root-cause identification.

visual_media

video_storyboard_001

Build a shot-by-shot video storyboard from a brief, formatted to industry conventions. Graded on coverage, continuity, and adherence to the reference shot list.

legal

legal_dr_fees_01

Compute legal fees from a billing register according to jurisdictional rules. Tests structured extraction plus rule-following against an authoritative reference total.

Methodology

Metrics

Pass Rate — fraction of tasks the agent fully completed (strict success). Score — average graded outcome across all tasks, including partial credit. Both computed by deterministic graders against hidden references.

Verifiable Outcomes

Hidden references plus deterministic graders, not LLM-as-a-judge. Tasks sourced from real professional workflows (After Effects, Siemens NX, Unreal Engine, Moldex3D, Rhino 3D, FSLeyes, and 49 more applications) and validated by domain experts before inclusion.

Rolling Evaluation

Every ~6 months, a new public subset releases with fresh instances. Private tasks rotate into the public pool, retired public tasks rotate out, and held-out private tasks score the official leaderboard, to limit benchmark leakage.

Reference Harnesses

Two open harnesses ship with the framework: the official Claude Code CLI and the in-tree OpenClaw harness. Submissions also include Codex, Cursor CLI, Droid, Gemini CLI, Grok CLI, and the ALE Claw reference harness.

Resources

Github

Paper

Contribute

Website

Acknowledgments

Agents’ Last Exam is co-led by UC Berkeley RDI and the RDI Foundation, with funding support and contributions from Snorkel AI via the Open Benchmarks Grants program. The benchmark draws task contributions from 300+ industry experts across 44 academic institutions (MIT, Harvard, Stanford, UC Berkeley, Oxford, CMU, Caltech, ETH Zurich, Yale, Columbia, and more) and industry organizations including Goldman Sachs, JPMorgan, Morgan Stanley, PIMCO, Meta, Amazon, Adobe, Oracle, Hippocratic AI, and HubSpot.

Advisory Committee includes George Em Karniadakis (Brown), Tapio Schneider (Caltech), Teresa Head-Gordon (UC Berkeley), Laure Zanna (NYU), Jack Gallant (UC Berkeley), Tarek Zohdi (UC Berkeley), Ida Sim (UCSF), Arvind Rao (U Michigan), Kaan Ozbay (NYU), Carl Boettiger (UC Berkeley), Kyle Steinfeld (UC Berkeley), Yamini Rangan (HubSpot), and Bradley Rothenberg (nTop).

Get notified when we launch a new benchmark

Your browser is currently blocking scripts, which prevents the form from loading.
Please enable scripts and refresh the page to continue.

Share this benchmark