Agents’ Last Exam: can AI agents actually do real jobs?

Jun 9, 2026

A large-scale benchmark testing whether AI agents can complete real, economically valuable professional work. Snorkel AI is a contributor and supporter through its Open Benchmarks Grants program.


What is Agents’ Last Exam?

Agents’ Last Exam (ALE) is a large-scale AI agent benchmark that measures whether autonomous agents can complete real professional work, not textbook problems. Instead of scripting an agent step by step, ALE hands a frontier agent a real task on a real machine, lets it work to completion, and scores the artifacts it leaves behind against verifiable success criteria.

Each task is a genuine project that a domain expert has already shipped, converted into a code-graded, fully reproducible test. There are no human judges, just fully reproducible, code-graded deliverables, which keeps results objective and directly comparable across very different kinds of work. The benchmark is co-led by UC Berkeley RDI and built with 250+ experts across 100+ institutions.

Why do we need a benchmark like ALE?

Forecasts increasingly claim that AI agents will outperform humans at almost all jobs by 2026 to 2027. Agents’ Last Exam was built to test that claim on real, labor-market-aligned work, and the results are a reality check.

The best-performing agent passes just 26% of tasks overall, and the average full pass rate on the hardest “Last-Exam” tier is only 2.6%. Most agent benchmarks are saturating quickly. ALE moves in the opposite direction by measuring the long-horizon, real-world work that today’s agents still cannot reliably finish.


That gap is exactly the point. The name “Agents’ Last Exam” reflects a simple idea: the day agents saturate ALE is the day they can genuinely power real industries. Worth noting: ALE is a technical instrument that measures task completion, not a direct labor-market forecast. It tracks capability, not economic outcomes.

Until then, it gives researchers a rigorous, reproducible way to see where frontier agents succeed and where they break.

Leaderboard (as of June 9)

#HarnessModelPass rateScoreTotal RuntimeTotal input tokensTotal output tokens
1Codexgpt-5-525.0%43.0%80h 37m577.0M3.8M
2Ale Clawgpt-5-523.0%45.8%47h 20m334.5M2.4M
3Openclawgpt-5-521.1%41.0%92h 51m471.1M3.3M
4Cursor Cligpt-5-520.7%39.6%82h 13m154.2M1.7M
5Openclawgpt-5-420.5%37.3%162h 16m545.5M8.7M
6Cursor Clicomposer-2-520.4%38.5%249h 59m338.8M2.9M
7Droidgpt-5-519.1%38.6%88h 10m243.2M2.3M
8Ale Clawclaude-opus-4-718.4%40.5%87h 54m1.4B5.7M
9Claude Codeclaude-opus-4-815.8%37.2%451h 15m452.0M3.8M
10Gemini Cligemini-3-1-pro-preview15.8%32.0%272h 28m1.2B3.5M

See full leaderboard here.

Domain coverage

ALE spans 55 non-physical sub-industries, grounded in the O*NET/SOC 2018 federal occupational taxonomy, covering most major categories of professional work performed on a computer.

FieldExample domains
Business & FinanceQuant finance & trading, accounting & risk modeling, sales & marketing, HR management
Engineering & RoboticsAerospace & mechanical engineering, civil & geospatial, robotics & autonomous systems
Life Sciences & MedicineGenomics & sequence analysis, cell & imaging biology, clinical informatics, neuroimaging
Physical SciencesPhysics, chemistry & astrophysics, earth & atmospheric sciences, materials design
Computing & MathSoftware engineering, data & analytics, AI engineering, cybersecurity, quantum computing
Media, Arts & DesignMotion & VFX, 3D animation, graphic & visual design, fashion & apparel
Legal, Education & SocialLegal research, educational technology, library & information science, social sciences

How ALE evaluates agents

ALE targets the Generalist Computer-Use Agent (GCUA): a system given full access to both a graphical interface (GUI) and a command line (CLI). The benchmark does not constrain how an agent solves a task. Whatever a human could do on a computer (click, type, script, browse, automate) the agent is free to do, and it is judged on the result rather than the method.

Tasks ship across three difficulty tiers (near-term, full-spectrum, and last-exam), plus an unlicensed track and ALE-CLI, a terminal-only subset for teams running command-line agents.

Current leaderboard (harness + flagship model): Codex (gpt-5-5) leads at roughly 26% overall, followed by Cursor (composer-2-5) and Claude Code (opus-4-8). Nobody is close to passing, and the exam is still open.

Why contribute?

  • See how agents handle your work: Get direct insight into how today’s frontier agents perform on the real workflows in your field, and exactly where they fall short.
  • Earn co-authorship: Qualifying contributors with merged tasks receive co-authorship credit on the ALE research manuscript.
  • Monetary awards: High-impact contributions are recognized with monetary awards from a $100K+ funding pool.

What makes a strong ALE task

ALE collects industry production tasks executed with professional-grade tools and software, not simplified chat interactions. Strong tasks meet three criteria:

  1. Complex. The task takes experts days, not minutes, and demands substantial domain expertise. (Good: move a cheetah into an Olympic race video end to end. Too simple: apply a color filter in DaVinci Resolve.)
  2. Representative. The workflow reflects how the job is really done, using the industry-standard tool for it (for example, building a 3D model in SolidWorks or Rhino, not AutoCAD).
  3. Verifiable. Outputs are deterministic or scored against a clear rubric, so results can be checked objectively. Open-ended work with infinite valid outputs is not suitable.

Every submission is defined by five components: task description, input, software, output, and evaluation.

How to contribute

  1. Explore. Review sample tasks and the FAQ to understand the task format and quality bar.
  2. Propose. Submit your workflow idea through the contribution form. Domain experts can contribute workflow data with no coding required; researchers and engineers can build full, reproducible tasks.
  3. Build & merge. Approved ideas are turned into code-graded tasks, reviewed for difficulty and quality, and merged into the official benchmark. Frontier agents are then run against merged tasks to calibrate difficulty before release.

Snorkel AI’s role

Snorkel AI is proud to support Agents’ Last Exam both as a contributor, with Snorkel researchers among the co-authors, and through its Open Benchmarks Grants program, which funds open-source AI research and rigorous, reproducible evaluation. Real-world evaluation is the bottleneck for trustworthy AI agents, and ALE is exactly the kind of open benchmark that moves the field forward.

This builds on Snorkel’s broader work in data-centric AI research and agent evaluation, including contributions to benchmarks such as Terminal-Bench Science and JudgmentBench.

For models that need to be right. Not just good enough.