Agents’ Last Exam: can AI agents actually do real jobs?

Jun 9, 2026

A large-scale benchmark testing whether AI agents can complete real, economically valuable professional work. Snorkel AI is a contributor and supporter through its Open Benchmarks Grants program.

2.6%

Top-agent pass
rate on the
hardest tier

Non-physical
industries
covered

1,500+

Real,
code-graded
tasks

250+

Experts
across 100+
institutions

What is Agents’ Last Exam?

Agents’ Last Exam (ALE) is a large-scale AI agent benchmark that measures whether autonomous agents can complete real professional work, not textbook problems. Instead of scripting an agent step by step, ALE hands a frontier agent a real task on a real machine, lets it work to completion, and scores the artifacts it leaves behind against verifiable success criteria.

Each task is a genuine project that a domain expert has already shipped, converted into a code-graded, fully reproducible test. There are no human judges, just fully reproducible, code-graded deliverables, which keeps results objective and directly comparable across very different kinds of work. The benchmark is co-led by UC Berkeley RDI and built with 250+ experts across 100+ institutions.

Why do we need a benchmark like ALE?

Forecasts increasingly claim that AI agents will outperform humans at almost all jobs by 2026 to 2027. Agents’ Last Exam was built to test that claim on real, labor-market-aligned work, and the results are a reality check.

The best-performing agent passes just 26% of tasks overall, and the average full pass rate on the hardest “Last-Exam” tier is only 2.6%. Most agent benchmarks are saturating quickly. ALE moves in the opposite direction by measuring the long-horizon, real-world work that today’s agents still cannot reliably finish.

That gap is exactly the point. The name “Agents’ Last Exam” reflects a simple idea: the day agents saturate ALE is the day they can genuinely power real industries. Worth noting: ALE is a technical instrument that measures task completion, not a direct labor-market forecast. It tracks capability, not economic outcomes.

Until then, it gives researchers a rigorous, reproducible way to see where frontier agents succeed and where they break.

Leaderboard (as of June 9)

#	Harness	Model	Pass rate	Score	Total Runtime	Total input tokens	Total output tokens
1	Codex	gpt-5-5	25.0%	43.0%	80h 37m	577.0M	3.8M
2	Ale Claw	gpt-5-5	23.0%	45.8%	47h 20m	334.5M	2.4M
3	Openclaw	gpt-5-5	21.1%	41.0%	92h 51m	471.1M	3.3M
4	Cursor Cli	gpt-5-5	20.7%	39.6%	82h 13m	154.2M	1.7M
5	Openclaw	gpt-5-4	20.5%	37.3%	162h 16m	545.5M	8.7M
6	Cursor Cli	composer-2-5	20.4%	38.5%	249h 59m	338.8M	2.9M
7	Droid	gpt-5-5	19.1%	38.6%	88h 10m	243.2M	2.3M
8	Ale Claw	claude-opus-4-7	18.4%	40.5%	87h 54m	1.4B	5.7M
9	Claude Code	claude-opus-4-8	15.8%	37.2%	451h 15m	452.0M	3.8M
10	Gemini Cli	gemini-3-1-pro-preview	15.8%	32.0%	272h 28m	1.2B	3.5M

See full leaderboard here.

Domain coverage

ALE spans 55 non-physical sub-industries, grounded in the O*NET/SOC 2018 federal occupational taxonomy, covering most major categories of professional work performed on a computer.

Field	Example domains
Business & Finance	Quant finance & trading, accounting & risk modeling, sales & marketing, HR management
Engineering & Robotics	Aerospace & mechanical engineering, civil & geospatial, robotics & autonomous systems
Life Sciences & Medicine	Genomics & sequence analysis, cell & imaging biology, clinical informatics, neuroimaging
Physical Sciences	Physics, chemistry & astrophysics, earth & atmospheric sciences, materials design
Computing & Math	Software engineering, data & analytics, AI engineering, cybersecurity, quantum computing
Media, Arts & Design	Motion & VFX, 3D animation, graphic & visual design, fashion & apparel
Legal, Education & Social	Legal research, educational technology, library & information science, social sciences

How ALE evaluates agents

ALE targets the Generalist Computer-Use Agent (GCUA): a system given full access to both a graphical interface (GUI) and a command line (CLI). The benchmark does not constrain how an agent solves a task. Whatever a human could do on a computer (click, type, script, browse, automate) the agent is free to do, and it is judged on the result rather than the method.

Tasks ship across three difficulty tiers (near-term, full-spectrum, and last-exam), plus an unlicensed track and ALE-CLI, a terminal-only subset for teams running command-line agents.

Current leaderboard (harness + flagship model): Codex (gpt-5-5) leads at roughly 26% overall, followed by Cursor (composer-2-5) and Claude Code (opus-4-8). Nobody is close to passing, and the exam is still open.

Why contribute?

See how agents handle your work: Get direct insight into how today’s frontier agents perform on the real workflows in your field, and exactly where they fall short.
Earn co-authorship: Qualifying contributors with merged tasks receive co-authorship credit on the ALE research manuscript.
Monetary awards: High-impact contributions are recognized with monetary awards from a $100K+ funding pool.

What makes a strong ALE task

ALE collects industry production tasks executed with professional-grade tools and software, not simplified chat interactions. Strong tasks meet three criteria:

Complex. The task takes experts days, not minutes, and demands substantial domain expertise. (Good: move a cheetah into an Olympic race video end to end. Too simple: apply a color filter in DaVinci Resolve.)
Representative. The workflow reflects how the job is really done, using the industry-standard tool for it (for example, building a 3D model in SolidWorks or Rhino, not AutoCAD).
Verifiable. Outputs are deterministic or scored against a clear rubric, so results can be checked objectively. Open-ended work with infinite valid outputs is not suitable.

Every submission is defined by five components: task description, input, software, output, and evaluation.

How to contribute

Explore. Review sample tasks and the FAQ to understand the task format and quality bar.
Propose. Submit your workflow idea through the contribution form. Domain experts can contribute workflow data with no coding required; researchers and engineers can build full, reproducible tasks.
Build & merge. Approved ideas are turned into code-graded tasks, reviewed for difficulty and quality, and merged into the official benchmark. Frontier agents are then run against merged tasks to calibrate difficulty before release.

Snorkel AI’s role

Snorkel AI is proud to support Agents’ Last Exam both as a contributor, with Snorkel researchers among the co-authors, and through its Open Benchmarks Grants program, which funds open-source AI research and rigorous, reproducible evaluation. Real-world evaluation is the bottleneck for trustworthy AI agents, and ALE is exactly the kind of open benchmark that moves the field forward.

This builds on Snorkel’s broader work in data-centric AI research and agent evaluation, including contributions to benchmarks such as Terminal-Bench Science and JudgmentBench.

For models that need to be right. Not just good enough.

Talk to a researcher

Explore research