Research

Agents’ Last Exam: AI Benchmarking for Real Work

June 30, 2026

•

13 min read

•

Snorkel Team

At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can today’s AI agents actually do the work that matters economically? The answer so far — with the hardest tier averaging below 1% full pass rate across frontier models — suggests we’re further from that bar than leaderboards imply.

Transcript

Lightly edited for readability.

Thank you, Jason. I appreciate that Snorkel provided this opportunity to share our recent work. If you have questions, I’m happy to connect after the presentation as well.

So we call this Agents’ Last Exam. People know “Humanity’s Last Exam” — we borrowed the name. I’m actually a little regretful about that. I think it should have been called Agent’s First Exam, because I feel it’s a first genuine attempt to benchmark real workflows that experts actually perform. It’s definitely not the last. If I had another chance, I’d use “first.” But for now, we’re stuck with it.

Our one central question is very simple: can agents really do economically valuable work in the real world? We make it testable at the level of professional workflows — not just question answering. Today I’ll cover three things: what a benchmark should measure, how to turn expert workflows into scalable and reliable evaluations, and what current results do and don’t show about agent readiness.

The history and importance of benchmarks

A benchmark is not just a number on a leaderboard. When you have a verifiable, widely-used evaluation, it sends a clear signal to the whole field — failures become visible, comparisons become meaningful, and iterations speed up. The curves from the past few years in math, coding, and reasoning show why benchmark design matters: we optimize what we measure.

The question now is whether agent benchmarks can actually measure the work we want agents to do. Anthropic’s chief has said AI will surpass almost all humans at almost everything shortly after 2027. Whether or not you trust that timeline, expectations have shifted. Agents are increasingly discussed as systems for professional work, not just question answering or mathematical reasoning. That raises the evaluation bar: we want benchmarks that cover complete software-mediated workflows and assess the quality of the resulting artifacts.

Here’s the coverage problem. Existing agent benchmarks concentrate on computing and mathematical domains, which represent only about 7% of the U.S. job market. The remaining 93% is not all automatable — much of it is physical or outside agent scope — but the gap in software-mediated workflows across management, finance, law, and engineering remains largely unexplored.

This matters because benchmarks lead the field in a specific loop: once you have a well-defined benchmark, frontier labs follow it, curate training data to improve on it, announce state-of-the-art results, and the cycle continues. People complain about benchmark gaming — and they’re right that a high score doesn’t always mean the model is good. But that’s not a reason to abandon benchmarks. It’s a reason to build better ones. If a benchmark genuinely reflects real professional workflows, I don’t mind if it gets saturated. Saturation would mean agents can actually do that work.

What should an agent benchmark measure?

To answer the coverage question, we started with an external map of professional work: the 2018 U.S. Standard Occupational Classification, which lists 867 detailed occupations across the economy. We then identified computer-centered workflows within that map — filtering out jobs like driving that are outside agent scope — and added two frontier fields on the AI side that the 2018 taxonomy predates. The result is 55 subfields across 13 industry clusters.

Let me walk through a single example to make this concrete. Manufacturing was completely absent from prior benchmarks, despite representing a huge share of GDP, particularly in regions like China. We worked directly with manufacturers and got real workflows and real data. One workflow involves turning a 2D blueprint into a 3D model, running a mold flow simulation to predict where a plastic iPhone shell might fail under heat, and then generating CNC toolpaths to cut the piece without collision. If an agent can reliably automate that pipeline, the whole industry transforms. As of now, no agent scores above 10% on those tasks, and the full pass rate is zero.

What makes ALE different

There are three core differentiators.

Scope. We have 300+ domain experts contributing real daily tasks — not synthetic scenarios. The benchmark currently covers 1,500 tasks across 55 fields, which is significantly broader than SWE-bench or OSWorld, which cover around 5 subfields each.

Generalist Computer-Use Agents (GCUA). People often equate computer-use agents with GUI agents. I disagree. Computer use should encompass both CLI and GUI operation. We coined the term “Generalist Computer-Use Agent” to capture that broader scope — and the recent Claude Code and Codex technical reports have started adopting this framing. We give agents a full environment — VM, Docker containers, GUI access, terminal — and evaluate only on the outcome. We don’t care whether you used a GUI or the command line. An interesting finding: for some biology tasks, agents skip the 3D brain viewer entirely, go straight to the files via CLI, and still pass. That’s fine. What matters is the output.

Verifiability. I’m skeptical of LLM-as-judge for subjective evaluations. When you ask a model to judge something truly subjective, agreement is low and reliability suffers. So we use deterministic code verifiers wherever possible. For cases where verifiability seems hard, we’ve found two approaches that work well. First, redesign the task to be verifiable — instead of asking for an original RPG game, ask the agent to reproduce a specific existing flash game in a new engine, then compare game state frame by frame. Second, for tasks with ranges of valid answers, we define an acceptance window, and anything within that range scores full credit.

How we build tasks

The hard part isn’t collecting prompts. It’s translating professional intent into a benchmark interface that preserves authenticity and remains evaluable. We start from actual work that experts have done in their daily lives — real blueprints from real factories, real datasets from real research labs.

When working with experts, we apply three criteria for admitting a task:

Representativeness. The workflow should match real professional practice. For example, structural engineers today use Rhino or SolidWorks — not AutoCAD — to convert 2D prints to 3D models. Tasks must reflect what people actually do, not a five-year-old assumption.
Complexity. We reject workflows that require only one or two clicks. We want end-to-end deliverables — tasks that would take an expert meaningful time and require combining multiple steps.
Verifiability. We design every task so a deterministic code verifier can score the outcome.

Our submission pipeline works like an academic peer review system. Experts submit a package: the task description, input files, tools used, expected deliverables, and evaluation criteria. Engineers implement the task and stress-test it. An AI-assisted first-pass review filters low-quality submissions, followed by human engineer review and, finally, expert peer review. Because I didn’t want to discourage contributors, I renamed “reject” to “major revision” and “minor reject” to “minor revision.” People feel better about a revision than a rejection, and the quality has stayed high.

We also designed ALE as a living benchmark. To prevent overfitting, we rotate the task pool every six months. If you overfit on the current batch, the next batch will expose that. We’re still accepting new task submissions at agents-last-exam.org/submit.

Evaluation setup

Each task follows a controlled lifecycle: load environment, run agent, collect output, score against reference. Reference solutions are stored in an isolated location that agents cannot access, preventing leakage.

Scoring works in two modes. For most tasks, we evaluate the final deliverable — comparing output files, screenshots, or game states against expert references. For tasks where partial credit is meaningful, we evaluate intermediate milestones. For instance, if an agent must download a target file as step one, we check for that file and award partial credit even if later steps fail.

To make the GCUA framing concrete, we think of it like a human body. The brain is the backbone model. The eyes are GUI perception. The hands are tool use, API calls, and coding. The body requires a full runtime — VMs, Docker, orchestration. Earlier CLI-only agents lack eyes; GUI-only agents have limited tool support. We provide both and let agents use whichever approach achieves the goal.

Results

The main leaderboard uses three columns: harness, backbone model, and thinking budget. Agents are free to use any combination.

Across all mainstream configurations, the average full pass rate on ALE’s hardest tier remains below 1%. GPT-5.5 with extended effort and Claude Code are currently at the top overall. Claude Max achieves the best absolute score we’ve recorded — but it cost $4,000 to evaluate on 100 tasks, which my professor was not happy about. There’s a real cost-efficiency dimension here that people shouldn’t ignore.

A few notable findings:

No single agent wins everywhere. Claude 4.5 leads on agentic coding benchmarks, but on life science and visual/media tasks in ALE, it falls significantly behind GPT-5.5 and Claude 3.7. Part of this is that some Claude versions had safety-related refusals that required multiple retries; another part is domain-specific capability gaps. The point is that single leaderboard numbers averaged across everything obscure real variation at the subdomain level.

Model choice matters more than harness. Fixing the harness and swapping the model produces up to 18 points of variance. Fixing the model and swapping between reasonable harnesses — OpenHands, Claude Code, Cursor, our own minimal KALE implementation — produces much less variance. As long as a harness has a basic orchestration layer, reasonable tool access, and isn’t missing obvious capabilities, performance is roughly similar. You can cut costs 41% by using a well-designed lightweight harness with no meaningful performance penalty.

GUI will not die. I posed a question to the audience: do you think GUI interfaces will be irrelevant to agents in five years? A lot of people think yes — CLI is faster. But consider this: game companies aren’t going to expose their internal APIs. Proprietary enterprise software often exposes incomplete or no CLI interface. And GUI speed relative to CLI is improving. Right now GUI is about 50x slower than CLI; if it gets to 5x slower, it already covers many real-world scenarios. I still think GUI remains essential to genuine generalist computer use.

The “last exam” framing is intentional. “Last” carries two meanings: a competence threshold (passing means you can do the real professional work, not just answer questions about it) and a difficulty frontier (it sits at the boundary of what current systems can reliably accomplish). When agents can saturate ALE, that saturation will mean something real — not just benchmark gaming, but genuine economic capability.

Q&A

Q: Do you have quantitative evidence that ALE’s rankings are consistent with other benchmarks? Not just for the top models, but across the middle tier?

Looking at the leaderboard, the top tier — GPT-5.5 and Claude Code — is consistent with what people expect. The second tier, models like Kimi and DeepSeek, also broadly aligns. Inside each tier there’s differentiation: models excel in different domains. GPT-5.5 leads in visual and media; Claude variants have had weaker life science scores. The single number is an average — what’s more useful is the subdomain breakdown, which tells you where to invest. We also share trajectories with the labs directly, because trajectory-level data is how they diagnose where models fail and how to improve.

Q: How do you handle tasks that have multiple correct solutions? CNC toolpath generation, for example, is NP-hard — there are many valid paths.

For deterministic tasks like game reproduction, there’s one correct answer. For numerical tasks, experts provide an acceptance range, and anything within that range passes. For structural tasks like G-code generation, we check the output against a set of constraints — no collisions, final shape matches reference — rather than comparing to a single canonical path. Multiple valid trajectories can all pass. We also maintain a pool of acceptable answers for tasks with genuinely discrete multiple correct solutions.

Q: Can you train a LoRA or fine-tune a model on your tasks to improve performance?

Yes, and people will. But our rolling benchmark with six-month rotation makes that fragile. If you overfit on the current task batch, your score on the next batch won’t hold. The harder question is whether fine-tuning on our verifier-backed tasks generalizes to real workflows — and I think that’s actually the most interesting research direction here.

Q: Where do agents fail most? Does your evaluation capture how they fail, not just whether they pass?

Partially. We evaluate primarily on outcome, but for tasks with intermediate milestones we also score those checkpoints — downloading a required file, reaching a specified game state. We don’t score how an agent reaches a goal (GUI vs. CLI), only whether the outcome matches the reference. For diagnostic purposes, we share full trajectories with labs so they can analyze failure modes in detail.

Q: When do we stop building harder benchmarks? If agents get to 95% on daily tasks, do we need ALE?

The history of math benchmarks is instructive. Once agents could solve IMO-level problems, the field shifted toward measuring practical reliability rather than just difficulty ceiling. I think the same will happen here. For now, the important thing is that agents are nowhere near reliable on the professional workflows ALE covers. When they are, ALE’s value shifts from measuring headroom to confirming deployment readiness. Future versions will likely specialize by domain as the generalist problem gets solved.

Q: What’s the plan for ALE v2?

The living benchmark design means ALE is already continuously updated — new tasks come in, the batch rotates every six months. What I’d call “ALE v2” thinking is about how the benchmark evolves as agents improve: moving toward even longer-horizon tasks, multi-agent workflows, and domains where the economy most needs reliable automation.

Yiyou Sun is a researcher at UC Berkeley’s Center for Responsible and Decentralized Intelligence. You can find him on X (@YiyouSun) and LinkedIn. David (Xinyang) Han is also a core contributor and available on LinkedIn. The paper is at arxiv.org/abs/2606.05405 and the live leaderboard is here.

Previous reading group sessions: Olmix: Data Mixing for LLM Development · Code World Models and AutoHarness · Collaborative Gym · JudgmentBench

Share this article