Open Benchmarks Grants

Agents' Last Exam

A benchmark for evaluating AI agents on long-horizon, economically valuable professional workflows with verifiable outcomes. 55 sub-industries, 1,500+ tasks toward a 5,000-task target, sourced and validated by 300+ industry experts.

Built with

Overview

Agents’ Last Exam (ALE) is building the broadest-coverage agent evaluation benchmark to date, measuring performance on long-horizon, economically valuable tasks with verifiable outcomes. The benchmark covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy), spanning all 55 targeted sub-industries.

ALE-V1 ships 147 reference tasks across 55 industries as the current public subset of a 1,500+ task corpus. Many tasks require private data or licensed software and remain in a separate private pool. ALE uses rolling evaluation: every ~6 months a new public subset is published with fresh instances, while private tasks rotate in and retired public tasks rotate out, to limit benchmark leakage.

Leaderboard

Rank	Harness	Model	Effort	Pass Rate	Score	Est. Cost	Runtime	Input Tokens	Output Tokens
1	Codex	GPT-5.6 Sol	XHigh	30.6%	53.6%	$762	94h 39m	762.9M	3.8M
2	Codex	GPT-5.6 Sol	High	30.6%	52.1%	$577	94h 33m	594.0M	2.9M
3	Codex	GPT-5.6 Sol	Max	29.6%	52.6%	$1,085	103h 46m	1.1B	5.4M
4	Codex	GPT-5.6 Luna	XHigh	29.6%	48.3%	$235	66h 7m	1.4B	4.2M
5	Codex	GPT-5.6 Sol	Medium	29.5%	51.9%	$518	91h 17m	539.9M	2.2M
6	Kimi Code	Kimi K3	Max	28.3%	51.6%	—	186h 5m	1.4B	7.5M
7	Codex	GPT-5.6 Luna	Max	28.3%	49.2%	$391	66h 27m	2.5B	6.8M
8	Codex	GPT-5.6 Terra	Max	28%	50.5%	$545	118h 39m	1.2B	5.8M
9	Codex	GPT-5.6 Terra	XHigh	27.6%	48.5%	$381	87h 20m	822.5M	3.8M
10	Claude Code	Kimi K3	Max	27%	50.7%	$773	215h 32m	6.0B	7.4M
11	Claude Code	Claude Opus 4.8	Max	27%	45.2%	$3,985	179h 19m	1.7B	22.1M
12	Codex	GPT-5.5	XHigh	26.6%	47.9%	$602	97h 8m	560.5M	6.4M
13	Codex	GPT-5.6 Terra	High	26%	46.3%	$264	98h 34m	533.9M	2.8M
14	Claude Code	Claude Fable 5	XHigh	25.7%	48.7%	$4,340	70h 0m	2.7B	10.4M
15	Codex	GPT-5.5	High	24.2%	44%	$416	75h 15m	489.9M	3.5M
16	Codex	GPT-5.6 Luna	High	23.7%	44.9%	$128	45h 53m	699.4M	2.6M
17	Codex	GPT-5.6 Sol	Low	23.6%	44.8%	$248	75h 8m	252.0M	1.3M
18	ALE-Claw	GPT-5.5	High	23%	45.8%	$310	35h 34m	334.3M	2.3M
19	Codex	GPT-5.6 Terra	Medium	22.7%	42.6%	$128	86h 4m	244.6M	1.6M
20	Claude Code	Claude Opus 4.8	XHigh	22.4%	42.3%	$2,815	144h 16m	1.2B	16.5M
21	Claude Code	Claude Fable 5	Adaptive	22%	40.5%	$2,315	106h 14m	880.0M	9.4M
22	OpenClaw	GPT-5.5	High	21.1%	41%	$447	82h 38m	469.3M	3.3M
23	Cursor CLI	GPT-5.5	Medium	20.7%	39.6%	$174	58h 47m	148.1M	1.7M
24	OpenClaw	GPT-5.4	High	20.5%	37.3%	$274	127h 46m	488.8M	7.3M
25	Cursor CLI	Claude Opus 4.7	High	20.4%	41.8%	$1,899	63h 49m	429.5M	4.5M
26	Claude Code	GLM-5.2	Max	20.4%	40.6%	$1,086	107h 42m	1.3B	7.7M
27	Codex	GPT-5.6 Terra	Low	20.4%	40.2%	$89	76h 23m	171.7M	1.1M
28	Claude Code	Claude Opus 4.8	High	20.4%	38.7%	$1,626	82h 49m	1.8B	6.4M
29	Cursor CLI	Composer 2.5	Adaptive	20.4%	38.5%	$177	77h 48m	338.8M	2.9M
30	Claude Code	Seed 2.1 Pro	-	19.5%	41.4%	$936	155h 14m	2.1B	8.4M
31	Droid	GPT-5.5	High	19.1%	38.6%	$242	63h 49m	234.8M	2.2M
32	ALE-Claw	Claude Opus 4.7	High	18.4%	40.5%	$1,132	77h 39m	1.3B	5.6M
33	Codex	GPT-5.5	Medium	18.4%	39.8%	$358	75h 11m	294.4M	2.9M
34	Codex	GPT-5.6 Luna	Medium	17.1%	37.5%	$57	42h 8m	287.2M	1.1M
35	Codex	GPT-5.5	Low	17.1%	36.4%	$147	38h 40m	109.3M	1.3M
36	Gemini CLI	Gemini 3.1 Pro	High	15.8%	32%	$2,018	92h 21m	1.2B	3.5M
37	OpenClaw	Claude Opus 4.7	High	15.1%	34.6%	$1,689	124h 10m	807.3M	4.0M
38	OpenClaw	Gemini 3.1 Pro	High	14.1%	28.7%	$2,969	147h 6m	3.4B	3.7M
39	Claude Code	Claude Opus 4.7	High	13.2%	35.1%	$1,793	42h 36m	456.4M	3.7M
40	OpenClaw	Seed 2.1 Pro	High	13.1%	32.2%	$440	190h 18m	1.6B	20.4M
41	Droid	Claude Opus 4.7	High	12.8%	31%	$1,356	28h 16m	352.1M	2.7M
42	OpenClaw	DeepSeek V4 Pro	High	12.4%	27.6%	$273	130h 47m	511.7M	4.7M
43	OpenClaw	Qwen3 7-Max	High	11.8%	31.1%	$664	101h 0m	1.4B	17.6M
44	Codex	GPT-5.6 Luna	Low	11.8%	30.5%	$27	24h 35m	119.9M	610.3K
45	ALE-Claw	GPT-5.4	High	11.8%	28.2%	$335	55h 28m	1.1B	2.0M
46	OpenClaw	GLM-5.1	High	11.5%	28.1%	$440	168h 39m	755.8M	5.9M
47	OpenClaw	Kimi K2.6	High	9.2%	21.7%	$123	183h 18m	295.9M	6.0M
48	OpenClaw	Qwen3 6-Plus	High	8.6%	24.3%	$254	134h 43m	738.9M	7.0M
49	OpenClaw	MiMo V2.5	High	8.6%	23.6%	$43	107h 22m	394.9M	4.0M
50	Grok CLI	Grok 4.3	-	6.6%	20.1%	$285	43h 7m	223.1M	2.3M
51	OpenClaw	MiniMax M2.7	High	5.9%	14.2%	$27	144h 37m	277.2M	4.4M
52	OpenClaw	Grok 4.3	High	4.3%	15.5%	$80	79h 4m	168.7M	2.7M

Rank	Harness	Model	Effort	Pass Rate	Score	Est. Cost	Runtime	Input Tokens	Output Tokens
1	Codex	GPT-5.6 Sol	XHigh	28.6%	53.6%	$437	49h 46m	443.5M	2.5M
2	Codex	GPT-5.6 Sol	Medium	28.6%	52.8%	$279	47h 32m	300.1M	1.4M
3	Codex	GPT-5.6 Luna	XHigh	28.6%	49.1%	$144	24h 53m	910.7M	2.8M
4	Claude Code	Kimi K3	Max	27.6%	52.8%	$418	109h 24m	2.0B	4.7M
5	Kimi Code	Kimi K3	Max	27.6%	52%	—	102h 15m	604.4M	5.0M
6	Codex	GPT-5.6 Luna	Max	27.6%	50.7%	$246	35h 22m	1.6B	4.3M
7	Codex	GPT-5.6 Sol	High	27.1%	50.9%	$344	51h 12m	359.1M	1.9M
8	Codex	GPT-5.5	XHigh	27.1%	49.2%	$385	47h 34m	357.1M	4.3M
9	Codex	GPT-5.6 Sol	Max	26.7%	50.7%	$598	52h 15m	609.6M	3.5M
10	Codex	GPT-5.6 Terra	XHigh	26.2%	49.4%	$235	49h 6m	528.0M	2.6M
11	Codex	GPT-5.6 Terra	Max	25.7%	49.8%	$358	58h 41m	843.8M	3.9M
12	Claude Code	Claude Opus 4.8	Max	25.7%	44.1%	$1,953	110h 57m	642.2M	15.3M
13	ALE-Claw	GPT-5.5	High	24.8%	48.1%	$109	16h 44m	77.0M	1.3M
14	Codex	GPT-5.6 Terra	High	24.3%	46.6%	$146	46h 22m	297.4M	1.8M
15	Claude Code	Claude Fable 5	XHigh	23.8%	48.5%	$1,663	34h 35m	900.2M	6.4M
16	Codex	GPT-5.6 Luna	High	23.8%	46%	$73	16h 49m	413.7M	1.7M
17	OpenClaw	GPT-5.5	High	23.8%	44.4%	$219	31h 30m	211.2M	2.2M
18	Claude Code	GLM-5.2	Max	23.8%	43.4%	$981	64h 18m	1.2B	6.3M
19	Codex	GPT-5.6 Sol	Low	23.3%	46.3%	$132	44h 41m	140.2M	820.3K
20	Codex	GPT-5.5	High	23.3%	44.6%	$221	31h 18m	292.7M	2.1M
21	OpenClaw	GPT-5.4	High	22.5%	41.8%	$169	50h 7m	285.8M	4.9M
22	Codex	GPT-5.6 Terra	Medium	22.4%	43.9%	$65	53h 4m	125.1M	1.0M
23	Cursor CLI	Claude Opus 4.7	High	21.9%	44.7%	$1,045	24h 54m	207.5M	2.5M
24	Claude Code	Claude Fable 5	Adaptive	21.9%	44.6%	$880	69h 43m	301.4M	6.9M
25	Claude Code	Claude Opus 4.8	High	21.9%	42.7%	$1,016	58h 43m	1.1B	4.2M
26	Cursor CLI	GPT-5.5	Medium	21.9%	41.8%	$97	14h 30m	68.5M	1.1M
27	Codex	GPT-5.6 Terra	Low	20.5%	42.5%	$50	45h 45m	99.3M	779.0K
28	ALE-Claw	Claude Opus 4.7	High	20%	43.3%	$416	44h 15m	389.6M	2.9M
29	Claude Code	Claude Opus 4.8	XHigh	20%	42.1%	$1,713	84h 37m	641.0M	11.9M
30	Claude Code	Seed 2.1 Pro	-	19%	41.4%	$602	88h 11m	1.1B	5.8M
31	Cursor CLI	Composer 2.5	Adaptive	19%	40.8%	$81	60h 10m	153.2M	1.8M
32	Droid	GPT-5.5	High	19%	39.5%	$136	37h 27m	117.0M	1.5M
33	Codex	GPT-5.6 Luna	Medium	18.1%	39.8%	$25	13h 32m	127.3M	721.7K
34	Codex	GPT-5.5	Low	18.1%	39.4%	$96	16h 26m	75.9M	901.9K
35	Codex	GPT-5.5	Medium	17.1%	39.4%	$195	26h 28m	157.9M	1.9M
36	Gemini CLI	Gemini 3.1 Pro	High	17.1%	36.2%	$1,010	52h 27m	450.8M	2.4M
37	OpenClaw	Claude Opus 4.7	High	16.2%	37.8%	$960	53h 29m	516.2M	2.8M
38	OpenClaw	Gemini 3.1 Pro	High	15.7%	31.7%	$1,303	66h 23m	1.4B	2.3M
39	Claude Code	Claude Opus 4.7	High	14.3%	38%	$1,165	25h 47m	281.0M	2.7M
40	Droid	Claude Opus 4.7	High	14.3%	33.7%	$710	19h 17m	176.9M	1.7M
41	OpenClaw	DeepSeek V4 Pro	High	14.1%	30.8%	$161	56h 12m	312.8M	3.3M
42	OpenClaw	Qwen3 7-Max	High	13.3%	34%	$534	34h 23m	1.0B	15.8M
43	ALE-Claw	GPT-5.4	High	13.3%	33.1%	$168	26h 31m	541.3M	1.1M
44	Forgecode	Claude Sonnet 4.6	Medium	13.3%	28.5%	$100	51h 31m	136.5M	1.7M
45	Hermes	Claude Sonnet 4.6	Medium	13.2%	32%	$437	25h 12m	195.6M	2.0M
46	OpenClaw	GLM-5.1	High	12.9%	30.8%	$213	73h 57m	361.0M	3.3M
47	Terminus 2	Claude Sonnet 4.6	Off	11.9%	30.9%	$320	73h 1m	760.4M	3.1M
48	OpenClaw	Seed 2.1 Pro	High	11.4%	33.2%	$440	98h 0m	1.4B	18.4M
49	Codex	GPT-5.6 Luna	Low	11.4%	32.8%	$13	9h 12m	56.3M	407.1K
50	OpenClaw	Claude Sonnet 4.6	High	11.4%	31%	$181	33h 51m	247.0M	1.9M
51	OpenClaw	Qwen3 6-Plus	High	10.5%	28.6%	$128	62h 28m	366.8M	4.4M
52	OpenClaw	MiMo V2.5	High	10%	26.5%	$32	51h 24m	286.4M	3.0M
53	Openhands	Claude Sonnet 4.6	Not reported	9%	19.8%	$243	35h 47m	349.9M	4.4M
54	OpenClaw	Kimi K2.6	High	8.1%	21.2%	$91	89h 31m	222.9M	4.4M
55	Grok CLI	Grok 4.3	-	7.6%	24.3%	$205	36h 12m	161.0M	1.7M
56	OpenClaw	MiniMax M2.7	High	5.7%	14.6%	$22	52h 56m	242.7M	3.2M
57	OpenClaw	Grok 4.3	High	4.3%	17.5%	$61	36h 8m	135.8M	2.1M

Rank	Harness	Model	Effort	Pass Rate	Score	Est. Cost	Runtime	Input Tokens	Output Tokens
1	Codex	GPT-5.6 Sol	Max	47.8%	78.8%	$322	40h 29m	304.2M	1.8M
2	Codex	GPT-5.6 Luna	XHigh	47.8%	74.1%	$69	19h 22m	359.7M	1.6M
3	Codex	GPT-5.6 Sol	High	47%	77.5%	$172	33h 30m	155.3M	1.0M
4	Codex	GPT-5.6 Sol	XHigh	45.5%	78.3%	$210	34h 43m	190.0M	1.3M
5	Codex	GPT-5.6 Sol	Medium	45.3%	75.6%	$137	34h 16m	124.2M	787.5K
6	Codex	GPT-5.6 Terra	XHigh	44.8%	73.9%	$96	31h 27m	170.7M	1.3M
7	Codex	GPT-5.5	XHigh	44%	72.4%	$170	29h 32m	126.2M	2.3M
8	Codex	GPT-5.6 Luna	Max	43.3%	73.2%	$96	21h 49m	541.7M	2.2M
9	Codex	GPT-5.6 Terra	High	43.3%	70.8%	$84	40h 16m	156.0M	1.0M
10	Claude Code	Claude Opus 4.8	Max	43.3%	64%	$1,106	49h 40m	310.5M	8.3M
11	Codex	GPT-5.6 Terra	Max	41.8%	73.3%	$141	37h 13m	268.6M	2.0M
12	Claude Code	Kimi K3	Max	40.3%	71.6%	$242	66h 36m	1.5B	2.7M
13	Codex	GPT-5.6 Sol	Low	39.3%	66.8%	$80	26h 43m	71.0M	467.6K
14	Kimi Code	Kimi K3	Max	38.8%	69.7%	—	50h 48m	256.7M	2.2M
15	Codex	GPT-5.6 Terra	Medium	38.8%	66%	$47	36h 15m	84.6M	590.9K
16	Codex	GPT-5.5	High	38.6%	66.2%	$121	27h 0m	162.7M	1.3M
17	Claude Code	Claude Fable 5	XHigh	37.3%	71.1%	$1,018	21h 14m	504.9M	3.6M
18	Codex	GPT-5.6 Luna	High	35.8%	66.2%	$46	16h 50m	228.8M	1.1M
19	OpenClaw	GPT-5.5	High	35.8%	65.7%	$218	41h 19m	223.0M	1.5M
20	Claude Code	Claude Opus 4.8	XHigh	35.8%	62.7%	$683	42h 8m	201.3M	6.0M
21	Claude Code	Claude Fable 5	Adaptive	34.3%	63.4%	$947	31h 49m	341.7M	3.1M
22	Cursor CLI	Composer 2.5	Adaptive	34.3%	61.1%	$50	22h 1m	94.1M	1.1M
23	Claude Code	Claude Opus 4.8	High	34.3%	56.8%	$446	21h 15m	397.7M	2.2M
24	OpenClaw	GPT-5.4	High	33.6%	57.8%	$69	40h 56m	86.0M	2.4M
25	ALE-Claw	GPT-5.5	High	32.8%	67.4%	$148	14h 57m	167.2M	1.0M
26	Claude Code	GLM-5.2	Max	32.8%	59.1%	$204	32h 45m	187.0M	2.7M
27	Cursor CLI	GPT-5.5	Medium	32.1%	60.8%	$68	30h 10m	51.8M	741.1K
28	Codex	GPT-5.6 Terra	Low	32.1%	60.7%	$34	32h 24m	62.2M	448.1K
29	Cursor CLI	Claude Opus 4.7	High	29.9%	61.2%	$558	33h 26m	110.3M	1.6M
30	Codex	GPT-5.6 Luna	Medium	29.9%	59.7%	$19	18h 46m	90.6M	452.4K
31	Codex	GPT-5.5	Medium	29.9%	59.3%	$110	13h 22m	81.7M	1.1M
32	Droid	GPT-5.5	High	29.9%	58.2%	$106	23h 26m	98.4M	962.6K
33	ALE-Claw	Claude Opus 4.7	High	28.4%	60.5%	$312	20h 40m	359.2M	1.9M
34	Droid	Claude Opus 4.7	High	27.6%	60.2%	$738	16h 6m	176.2M	1.5M
35	Claude Code	Seed 2.1 Pro	-	26.9%	58.9%	$376	66h 30m	697.7M	3.1M
36	OpenClaw	Claude Opus 4.7	High	26.9%	56.5%	$508	47h 42m	201.7M	1.5M
37	Gemini CLI	Gemini 3.1 Pro	High	26.9%	53.5%	$342	27h 40m	239.4M	1.1M
38	OpenClaw	Gemini 3.1 Pro	High	26.1%	48.3%	$575	48h 37m	616.7M	938.9K
39	Codex	GPT-5.5	Low	25.4%	53.5%	$48	10h 16m	28.2M	528.4K
40	Codex	GPT-5.6 Luna	Low	22.4%	47.1%	$11	14h 7m	46.7M	263.7K
41	Claude Code	Claude Opus 4.7	High	20.9%	54.3%	$496	16h 17m	120.4M	1.1M
42	ALE-Claw	GPT-5.4	High	20.9%	44.3%	$66	17h 22m	178.6M	684.8K
43	OpenClaw	GLM-5.1	High	20.1%	45.6%	$108	62h 17m	183.5M	1.6M
44	OpenClaw	DeepSeek V4 Pro	High	19.9%	43.8%	$109	58h 50m	208.2M	1.9M
45	OpenClaw	Qwen3 7-Max	High	17.9%	46.9%	$247	38h 11m	502.0M	7.3M
46	OpenClaw	Seed 2.1 Pro	High	17.9%	46.9%	$201	76h 7m	640.9M	8.9M
47	OpenClaw	Kimi K2.6	High	15.7%	35.1%	$47	61h 6m	118.5M	2.1M
48	OpenClaw	Qwen3 6-Plus	High	12.7%	35.8%	$90	60h 13m	259.7M	2.7M
49	OpenClaw	MiMo V2.5	High	11.9%	35.1%	$13	43h 37m	105.4M	1.5M
50	OpenClaw	MiniMax M2.7	High	10.4%	24.5%	$9	63h 22m	98.2M	1.4M
51	Grok CLI	Grok 4.3	-	9%	30.4%	$127	21h 59m	99.8M	1.0M
52	OpenClaw	Grok 4.3	High	6.7%	24.4%	$35	31h 15m	73.8M	1.2M

Rank	Harness	Model	Effort	Pass Rate	Score	Est. Cost	Runtime	Input Tokens	Output Tokens
1	Codex	GPT-5.6 Sol	XHigh	30%	46.6%	$247	28h 47m	235.4M	1.3M
2	Kimi Code	Kimi K3	Max	29.1%	49.9%	—	55h 44m	465.4M	2.8M
3	Codex	GPT-5.6 Terra	Max	27.3%	45.6%	$150	38h 15m	316.9M	1.8M
4	Codex	GPT-5.6 Sol	Medium	27.3%	43.9%	$157	29h 58m	142.1M	732.3K
5	Codex	GPT-5.6 Sol	High	27.3%	42.4%	$166	29h 53m	153.0M	942.9K
6	Codex	GPT-5.6 Luna	Max	27.3%	40.7%	$107	19h 52m	652.4M	2.2M
7	Codex	GPT-5.6 Luna	XHigh	27.3%	40.4%	$61	22h 35m	351.8M	1.4M
8	Codex	GPT-5.6 Sol	Max	25.5%	44.7%	$300	28h 9m	286.3M	1.8M
9	Claude Code	Claude Fable 5	XHigh	25.5%	42%	$1,826	25h 34m	1.1B	3.8M
10	Codex	GPT-5.6 Luna	High	25.5%	41.9%	$38	10h 25m	202.0M	828.7K
11	Codex	GPT-5.6 Terra	XHigh	24.5%	41.3%	$108	29h 0m	215.8M	1.2M
12	Claude Code	Seed 2.1 Pro	-	24.1%	40.1%	$289	42h 44m	639.5M	3.2M
13	Claude Code	Kimi K3	Max	23.6%	44.9%	$240	64h 56m	1.8B	2.4M
14	Claude Code	Claude Opus 4.8	Max	23.6%	43.1%	$1,324	59h 37m	524.3M	8.2M
15	ALE-Claw	GPT-5.5	High	23.6%	41.1%	$70	8h 28m	54.9M	734.8K
16	Codex	GPT-5.5	High	22.4%	36.7%	$130	20h 57m	130.8M	1.2M
17	Codex	GPT-5.5	XHigh	21.8%	38.7%	$178	29h 36m	142.8M	2.2M
18	Codex	GPT-5.6 Terra	High	20.9%	38.3%	$85	36h 27m	164.1M	940.7K
19	Claude Code	Claude Fable 5	Adaptive	20.9%	34.1%	$608	32h 11m	327.7M	3.7M
20	Claude Code	Claude Opus 4.8	XHigh	20%	38.3%	$856	43h 56m	300.4M	5.6M
21	Cursor CLI	Claude Opus 4.7	High	20%	37.6%	$842	14h 48m	215.6M	1.9M
22	Cursor CLI	GPT-5.5	Medium	20%	32.7%	$50	17h 24m	36.4M	552.1K
23	OpenClaw	GPT-5.4	High	19.4%	34.3%	$123	57h 48m	239.6M	2.9M
24	Codex	GPT-5.6 Sol	Low	19.1%	38.3%	$77	23h 19m	73.1M	437.4K
25	Claude Code	Claude Opus 4.8	High	18.2%	37.8%	$569	26h 8m	698.1M	2.3M
26	Claude Code	GLM-5.2	Max	18.2%	37%	$390	35h 45m	338.8M	2.7M
27	ALE-Claw	Claude Opus 4.7	High	18.2%	36.6%	$246	15h 31m	288.8M	1.7M
28	Codex	GPT-5.6 Terra	Low	18.2%	34.5%	$31	22h 57m	54.5M	398.1K
29	Codex	GPT-5.6 Terra	Medium	18.2%	34%	$44	28h 6m	78.9M	555.2K
30	OpenClaw	GPT-5.5	High	18.2%	32.1%	$101	22h 58m	98.6M	1.1M
31	Cursor CLI	Composer 2.5	Adaptive	18.2%	30.8%	$68	25h 41m	130.0M	1.1M
32	Droid	GPT-5.5	High	16.4%	33.4%	$69	15h 37m	58.6M	792.2K
33	Codex	GPT-5.5	Medium	16.4%	33.3%	$128	26h 53m	102.4M	1.1M
34	Codex	GPT-5.5	Low	16.4%	32.4%	$45	14h 49m	29.5M	425.3K
35	OpenClaw	Seed 2.1 Pro	High	13.5%	23.6%	$171	46h 53m	572.9M	7.7M
36	Claude Code	Claude Opus 4.7	High	12.7%	29.1%	$747	12h 5m	202.0M	1.5M
37	Codex	GPT-5.6 Luna	Medium	12.7%	28.8%	$18	8h 56m	85.6M	355.6K
38	Gemini CLI	Gemini 3.1 Pro	High	12.7%	26.4%	$962	25h 34m	481.7M	1.7M
39	OpenClaw	Claude Opus 4.7	High	10.9%	27.5%	$370	29h 47m	163.0M	1.1M
40	OpenClaw	Qwen3 7-Max	High	10.9%	27.2%	$280	35h 46m	585.1M	7.1M
41	OpenClaw	DeepSeek V4 Pro	High	10.9%	23.8%	$68	38h 12m	114.5M	1.5M
42	OpenClaw	Gemini 3.1 Pro	High	10.9%	23.6%	$1,392	62h 47m	1.6B	1.7M
43	ALE-Claw	GPT-5.4	High	9.1%	22.9%	$179	22h 36m	595.7M	930.0K
44	OpenClaw	GLM-5.1	High	9.1%	21.7%	$165	61h 57m	282.4M	2.2M
45	OpenClaw	MiMo V2.5	High	9.1%	20.8%	$17	34h 51m	159.9M	1.4M
46	OpenClaw	Qwen3 6-Plus	High	8.2%	22.9%	$111	46h 37m	322.1M	3.1M
47	Codex	GPT-5.6 Luna	Low	7.3%	24.9%	$9	5h 32m	36.3M	207.5K
48	Grok CLI	Grok 4.3	-	7.3%	17%	$95	10h 30m	74.0M	896.2K
49	OpenClaw	Kimi K2.6	High	6.4%	18.2%	$41	73h 35m	93.6M	2.4M
50	OpenClaw	Grok 4.3	High	3.6%	12.9%	$25	24h 45m	51.6M	1.0M
51	Droid	Claude Opus 4.7	High	3.6%	10.9%	$167	3h 8m	39.9M	472.1K
52	OpenClaw	MiniMax M2.7	High	3.6%	8.4%	$10	46h 17m	95.2M	1.7M

Rank	Harness	Model	Effort	Pass Rate	Score	Est. Cost	Runtime	Input Tokens	Output Tokens
1	Kimi Code	Kimi K3	Max	10.5%	20.6%	—	91h 53m	748.2M	3.1M
2	Claude Code	Kimi K3	Max	7.9%	20.5%	$333	100h 52m	3.2B	2.6M
3	Claude Code	Claude Fable 5	XHigh	7.9%	16.5%	$2,091	30h 16m	1.6B	3.8M
4	Codex	GPT-5.6 Sol	XHigh	5.3%	19.4%	$325	38h 53m	352.3M	1.4M
5	Codex	GPT-5.6 Sol	High	5.3%	18.7%	$253	39h 4m	297.5M	1.1M
6	Codex	GPT-5.6 Sol	Medium	3.9%	19.4%	$250	32h 36m	294.9M	783.9K
7	Codex	GPT-5.6 Sol	Max	2.6%	16.4%	$488	42h 49m	523.8M	2.1M
8	Codex	GPT-5.6 Luna	Max	2.6%	15%	$197	31h 35m	1.4B	2.6M
9	Codex	GPT-5.6 Terra	Max	2.6%	14.9%	$264	50h 55m	678.1M	2.2M
10	Claude Code	Claude Opus 4.8	Max	2.6%	14.4%	$1,805	84h 44m	961.0M	7.3M
11	Codex	GPT-5.5	Medium	2.6%	14.2%	$167	41h 36m	153.9M	905.0K
12	Claude Code	GLM-5.2	Max	2.6%	12.9%	$638	47h 56m	782.7M	2.7M
13	ALE-Claw	GPT-5.5	High	2.6%	12.8%	$107	13h 44m	126.3M	704.4K
14	Cursor CLI	Claude Opus 4.7	High	2.6%	11.4%	$588	18h 31m	121.0M	1.2M
15	Droid	GPT-5.5	High	2.6%	11.3%	$76	26h 27m	83.6M	551.8K
16	Cursor CLI	GPT-5.5	Medium	2.6%	10.7%	$61	17h 12m	63.9M	456.4K
17	Claude Code	Claude Opus 4.8	XHigh	2.6%	10.4%	$1,395	70h 29m	739.8M	5.9M
18	Codex	GPT-5.5	XHigh	1.3%	16.1%	$283	42h 39m	316.9M	2.2M
19	Codex	GPT-5.5	High	0%	13.4%	$188	33h 14m	216.4M	1.2M
20	Codex	GPT-5.6 Sol	Low	0%	12.6%	$96	30h 32m	111.3M	408.6K
21	Codex	GPT-5.6 Terra	Medium	0%	12.6%	$43	25h 34m	90.6M	486.7K
22	Codex	GPT-5.6 Terra	High	0%	12%	$106	28h 23m	236.9M	967.5K
23	Codex	GPT-5.6 Terra	Low	0%	11.2%	$31	23h 59m	65.9M	335.9K
24	Codex	GPT-5.6 Luna	High	0%	11%	$47	20h 58m	284.9M	772.0K
25	Codex	GPT-5.6 Luna	XHigh	0%	11%	$110	30h 12m	742.9M	1.3M
26	Codex	GPT-5.6 Terra	XHigh	0%	10.9%	$186	33h 22m	449.2M	1.4M
27	OpenClaw	GPT-5.5	High	0%	10.9%	$155	24h 52m	178.2M	997.0K
28	Claude Code	Seed 2.1 Pro	-	0%	10.8%	$319	52h 3m	814.8M	2.8M
29	Codex	GPT-5.6 Luna	Medium	0%	10.4%	$24	15h 42m	132.8M	319.4K
30	OpenClaw	Seed 2.1 Pro	High	0%	10.4%	$85	75h 30m	455.8M	4.8M
31	Claude Code	Claude Opus 4.7	High	0%	10%	$685	17h 6m	170.1M	1.4M
32	Codex	GPT-5.5	Low	0%	9.9%	$64	16h 3m	59.4M	378.0K
33	Codex	GPT-5.6 Luna	Low	0%	8.9%	$9	6h 13m	42.6M	162.5K
34	Cursor CLI	Composer 2.5	Adaptive	0%	8.8%	$78	34h 44m	151.0M	948.3K
35	ALE-Claw	GPT-5.4	High	0%	8.1%	$100	17h 39m	321.4M	540.7K
36	Droid	Claude Opus 4.7	High	0%	8%	$496	10h 18m	145.7M	920.1K
37	ALE-Claw	Claude Opus 4.7	High	0%	7.9%	$594	43h 8m	707.3M	2.2M
38	OpenClaw	GPT-5.4	High	0%	7.3%	$107	36h 16m	206.9M	2.5M
39	Claude Code	Claude Opus 4.8	High	0%	7%	$735	43h 8m	841.1M	2.2M
40	OpenClaw	Qwen3 7-Max	High	0%	6.4%	$173	33h 3m	377.8M	4.1M
41	OpenClaw	GLM-5.1	High	0%	6.2%	$204	57h 10m	354.0M	2.5M
42	Claude Code	Claude Fable 5	Adaptive	0%	5.2%	$838	52h 15m	231.2M	3.3M
43	OpenClaw	Qwen3 6-Plus	High	0%	5%	$78	35h 46m	228.9M	1.9M
44	OpenClaw	Claude Opus 4.7	High	0%	4.3%	$842	50h 0m	452.9M	1.5M
45	OpenClaw	MiMo V2.5	High	0%	3.4%	$15	33h 3m	135.8M	1.3M
46	OpenClaw	Gemini 3.1 Pro	High	0%	3.1%	$1,200	45h 9m	1.4B	1.1M
47	OpenClaw	DeepSeek V4 Pro	High	0%	2.5%	$108	39h 7m	208.8M	1.7M
48	Grok CLI	Grok 4.3	-	0%	2.3%	$71	11h 35m	56.2M	477.3K
49	OpenClaw	Grok 4.3	High	0%	2.3%	$22	25h 4m	47.2M	621.3K
50	OpenClaw	Kimi K2.6	High	0%	1.6%	$39	60h 46m	88.1M	1.8M
51	OpenClaw	MiniMax M2.7	High	0%	1.3%	$9	44h 14m	87.7M	1.5M
52	Gemini CLI	Gemini 3.1 Pro	High	0%	0.9%	$733	40h 2m	497.3M	821.0K

*claude-fable-5: the variant Anthropic served during evaluation may differ from the published model's full capability tier, and re-runs cannot guarantee the higher-tier variant is selected. These numbers may understate the model's true ceiling. Learn more.

Pass rate vs input tokens

Efficiency frontier for this evaluation split. Up and to the left is better (higher pass rate at lower input-token spend). Several harness/model pairs reach the top of the split at a fraction of the cost of others; some pairs spend an order of magnitude more tokens without a corresponding pass-rate gain.

Hover any dot for the harness, model, and exact metrics.

Industry coverage

Six representative task families across the 55 sub-industries ALE covers.

Motion & VFX

Animation and visual effects production tasks in Adobe After Effects.

3D modeling

3D model creation and editing tasks in Siemens NX.

Game development

Scene setup, asset placement, and rendering tasks in Unreal Engine.

Mold flow analysis

Simulation and mold flow analysis tasks in Moldex3D manufacturing software.

Architectural modeling

3D modeling and energy analysis workflows in Rhino 3D for urban design.

Brain imaging

Neuroimaging analysis and brain structure segmentation tasks in FSLeyes.

Sample tasks

A selection from the 147 public ALE-V1 tasks across 14 task categories. Each task ships with a sandboxed environment, a hidden reference, and a deterministic grader. Slugs link to the task source.

business finance

sec_10k_financial_parsing

Parse a SEC 10-K filing into a structured financial schema. Multi-step extraction, table normalization, and cross-reference validation against the original document.

business finance

financial_stmt_reconstruction_aapl_fy2024

Reconstruct Apple’s FY2024 financial statement from primary disclosure documents. Validates whether the agent surfaces the exact reported figures and footnote-relevant adjustments.

engineering

mold-flow / 220089

Set up a Moldex3D mold-flow simulation, run it to convergence, and report fill time / pressure metrics matching the held-out reference run.

health medicine

Clinical_Variant_Annotation

Annotate a clinical variant set using standard pipelines (VEP, ClinVar, etc.) and produce a report graded against a curated reference.

life sciences

WGS_Variant_Calling

Run a whole-genome sequencing variant-calling pipeline and produce VCF output. Scored on precision and recall against a held-out truth VCF.

computing math

k8s_payment_api_root_cause_analysis

Diagnose a failing payment API in a Kubernetes cluster. Multi-hop investigation across logs, metrics, manifests, and traces, scored on the correct root-cause identification.

visual media

video_storyboard_001

Build a shot-by-shot video storyboard from a brief, formatted to industry conventions. Graded on coverage, continuity, and adherence to the reference shot list.

legal

legal_dr_fees_01

Compute legal fees from a billing register according to jurisdictional rules. Tests structured extraction plus rule-following against an authoritative reference total.

Methodology

Metrics

Pass Rate — fraction of tasks the agent fully completed (strict success). Score — average graded outcome across all tasks, including partial credit. Both computed by deterministic graders against hidden references.

Verifiable Outcomes

Hidden references plus deterministic graders, not LLM-as-a-judge. Tasks sourced from real professional workflows (After Effects, Siemens NX, Unreal Engine, Moldex3D, Rhino 3D, FSLeyes, and 49 more applications) and validated by domain experts before inclusion.

Rolling Evaluation

Every ~6 months, a new public subset releases with fresh instances. Private tasks rotate into the public pool, retired public tasks rotate out, and held-out private tasks score the official leaderboard, to limit benchmark leakage.

Reference Harnesses

Two open harnesses ship with the framework: the official Claude Code CLI and the in-tree OpenClaw harness. Submissions also include Codex, Cursor CLI, Droid, Gemini CLI, Grok CLI, and the ALE Claw reference harness.

Resources

Github

Paper

Contribute

Website

Acknowledgments

Agents’ Last Exam is co-led by UC Berkeley RDI and the RDI Foundation, with funding support and contributions from Snorkel AI via the Open Benchmarks Grants program. The benchmark draws task contributions from 300+ industry experts across 44 academic institutions (MIT, Harvard, Stanford, UC Berkeley, Oxford, CMU, Caltech, ETH Zurich, Yale, Columbia, and more) and industry organizations including Goldman Sachs, JPMorgan, Morgan Stanley, PIMCO, Meta, Amazon, Adobe, Oracle, Hippocratic AI, and HubSpot.

Advisory Committee includes George Em Karniadakis (Brown), Tapio Schneider (Caltech), Teresa Head-Gordon (UC Berkeley), Laure Zanna (NYU), Jack Gallant (UC Berkeley), Tarek Zohdi (UC Berkeley), Ida Sim (UCSF), Arvind Rao (U Michigan), Kaan Ozbay (NYU), Carl Boettiger (UC Berkeley), Kyle Steinfeld (UC Berkeley), Yamini Rangan (HubSpot), and Bradley Rothenberg (nTop).

FAQs

ALE includes work in finance, engineering, healthcare, life sciences, law, media production, architecture, manufacturing, and computing. Tasks include reconstructing financial statements, running mold-flow simulations, annotating clinical variants, diagnosing Kubernetes failures, producing video storyboards, and calculating legal fees under jurisdiction-specific rules.

Many tasks require professional software such as Adobe After Effects, Siemens NX, Unreal Engine, Moldex3D, Rhino 3D, and FSLeyes. The agent must use the relevant files, data, and tools to create the requested output.

Each task runs inside a sandboxed Windows or Linux environment containing the required software and input data. The agent receives a task description and works independently using its own action loop, tools, memory, and sub-agents. ALE supports workflows that combine terminal commands with graphical computer interaction.

After the run ends, ALE introduces the hidden reference, grades the files or other artifacts produced by the agent, and records the trajectory, logs, screenshots, tool calls, and evaluation result.

Pass Rate is the percentage of tasks the agent completes fully. A task contributes to Pass Rate only when it satisfies the complete success criteria.
Score averages the graded result across all tasks and includes partial credit. An agent may therefore earn a meaningful Score by completing substantial portions of several workflows even when those tasks do not count as full passes. The two metrics separate useful progress from dependable end-to-end completion.

The hidden reference is added to the environment only after the agent finishes, preventing the expected answer or artifact from leaking into the task. The grader can then inspect measurable properties such as extracted figures, simulation outputs, file structure, dimensional accuracy, identified root causes, or agreement with a reference dataset.

This approach keeps evaluation tied to the completed work. It also avoids using an LLM as the final judge, which could introduce grader variability or preferences unrelated to the task requirements.

ALE-V1 is the current public release, containing 147 reference tasks across all 55 targeted sub-industries. It represents a subset of the broader task corpus. Some workflows remain private because they rely on licensed software, restricted data, or held-out evaluation material.

Agents’ Last Exam uses rolling evaluation. Approximately every six months, fresh task instances enter the public set, private tasks rotate into public release, and retired tasks leave the active evaluation. Held-out tasks are used for official scoring, reducing the value of training directly against a fixed public test set.

ALE evaluates the complete agent configuration, including the foundation model, harness, tools, memory, context management, action loop, and graphical or terminal interfaces. The same model can perform differently when paired with another harness or reasoning configuration.

Leaderboard comparisons should therefore treat each row as a model-and-harness system. Pass Rate and Score should also be considered alongside estimated cost, runtime, and token consumption when comparing practical efficiency.

Humanity’s Last Exam evaluates models on difficult, closed-ended academic questions across subjects such as mathematics, science, and the humanities. Agents’ Last Exam evaluates whether an agent can execute professional workflows inside real software environments and produce completed work products.

A system can possess strong academic knowledge while still struggling to manage files, operate specialized tools, recover from execution errors, and satisfy all the requirements of a long professional assignment.

The ALE repository provides an open evaluation framework, public tasks, sandbox provisioning, grading tools, and integrations for multiple agent harnesses. Supported execution options include cloud virtual machines, local containers for a lighter Ubuntu subset, QEMU/KVM, and existing computer-use sandboxes. Researchers can also add a custom agent by implementing an ALE deployer.

Domain experts can propose workflows that are complex, representative of real industry practice, and objectively verifiable. Researchers and engineers can help convert those workflows into reproducible environments and graders.

Get notified when we launch a new benchmark

Share this benchmark

Agents' Last Exam

Leaderboard

Pass rate vs input tokens

Industry coverage

Motion & VFX

3D modeling

Game development

Mold flow analysis

Architectural modeling

Brain imaging

Sample tasks

Methodology

Resources

Acknowledgments

FAQs

Get notified when we launch a new benchmark

More benchmarks

Frontier-Bench

Senior SWE-bench

OSWorld 2.0

Agentic Coding

SlopCode Bench

Continual Learning Bench

Terminal-Bench 2.1

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?