Research

Cua-Bench: benchmarking computer-use agents on professional software

June 15, 2026

•

9 min read

•

Zhengyang (Jason) Qi

,

Armin Parchami

TL;DR

We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers:

4 out of 25 full passes (16%). 16 hit the step cap without ever signalling completion. 5 declared they were done on a wrong artifact.
0 out of 16 build-from-scratch tasks succeeded. Every full pass was a single-component edit on a pre-existing schematic.
Failures concentrate in planning and perception. ~80% of error-mode mentions land in planning, perception, or navigation inefficiency.

1. Why build a computer-use benchmark for electrical engineering?

Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness — the tasks are short, the domain knowledge bar is low, and the UIs are designed to be intuitive for anyone. A frontier model can often muddle through by trial and error. That is not what real professional software looks like.

We wanted a domain that pushed agents on the dimensions that actually matter for real economic value:

Genuinely complex, professional software. KiCad is a full electronic design automation (EDA) suite. It has a dense shortcut grammar, multiple linked editors (Project Manager, Schematic, PCB, Symbol Library), and a thousand modal dialogs.
Real, valuable use cases. Schematic editing, component placement, value sizing, and netlist export are exactly the kind of work an engineer does every day. There is real demand for an agent that can take a description and produce a valid schematic.
A natural long-horizon structure. Even a modest circuit requires placing several components, wiring them correctly, setting values, running electrical rule checks, and saving. There is no single-shot solution, the agent has to plan, sequence, and self-verify.
Knowledge AND GUI execution, intertwined. Most tasks are unsolvable without both a working mental model of the circuit (“the LT3010 has a 0.808V reference voltage”) and the GUI fluency to translate that into clicks.
Objective, verifiable ground truth. A schematic either has the required components and wires or it does not. The netlist is machine-checkable. This makes graders cheap, deterministic, and unambiguous. Avoiding typical LLM-judge based evaluation shortcomings.
An open-source, batteries-included target. KiCad ships with a massive symbol library (thousands of parts across hundreds of vendor catalogs) which means we can ask for realistic designs (LM358 op-amps, LT3010 LDOs, NE555 timers) without licensing concerns or custom asset creation.

In short: EDA gives us a setting where solving the task requires real domain expertise, real long-horizon planning, real perception of a complex GUI, and tolerates none of the shortcuts that simpler benchmarks accidentally allow.

Join our newsletter

Get the scoop on new benchmarks, research, and exclusive events.

By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

2. How the tasks were built

Every one of the 25 tasks was authored by a practicing electrical engineer. Each task came with three artifacts:

A natural-language prompt describing what the agent should accomplish, e.g. “Add a load resistor R5 at the output of the second op-amp stage so that Vout sees a 10kΩ load to ground.”
An initial KiCad project (when the task started from an existing schematic), so the agent had a concrete starting point rather than a blank canvas.
A ground-truth final project state, which the verifier compares against the agent’s saved output. The verifier reads the netlist and checks for the right components, the right values, and the right connectivity.

Tasks were calibrated to be modest in length for a human expert. We explicitly asked annotators to write tasks that they themselves could complete in around 50 steps or fewer, given current agent capabilities. This was not about making the benchmark easy (every task in the suite is still well within what a competent EE intern can do in a few minutes), it was about ensuring that the 150-step budget the agent gets is genuinely generous.

A second engineer reviewed every task. We had an independent electrical engineer check each task end-to-end: that the prompt was unambiguous, that the initial project actually matched what the prompt described, that the ground-truth state was a valid solution (no errors in the arithmetic, no incorrect topology, no impossible component values), and that the prompt could not be reasonably interpreted in a way that would make the verifier mark a correct solution as wrong.

The resulting suite spans single-edit tasks (“change C1 to 2 nF”) through to full builds (“design a bipolar-to-unipolar converter from two op-amps, six resistors, and six capacitors”). The tasks anchor on a range of skills: bias-point arithmetic, LDO feedback dividers, Sallen-Key filter design, NE555 astable topology, H-bridge decoupling, switching regulators, and basic file-protocol operations like exporting a netlist.

3. What we learned

We ran Claude Sonnet 4.5 against all 25 tasks (with one Haiku 4.5 trajectory included as a data point). The results paint a strikingly coherent picture of where current computer-use agents are strong and where they fall off a cliff.

3.1 The capability cliff: edit-existing vs build-from-scratch

The single most striking finding is a sharp capability cliff at the boundary between editing an existing schematic and building one from scratch.

Every task whose required-edits count is ≤ 2 is at least partially solvable. Every task that requires placing 3+ components and drawing wires either fails outright or earns only partial credit.

The four full passes (changing a capacitor value, swapping a +9V power port for +5V, resizing a resistor for a target current ratio, centering a microphone bias point) all share the same shape: open the project, find the one component that needs touching, edit it, save, declare done. The agent has the domain arithmetic and the GUI navigation for these. What it does not have, yet, is the multi-phase planning loop needed to place a dozen components, wire them, set every value, run ERC, and self-verify before declaring completion.

3.2 What the agent is failing at, and why

The failure modes are not random. We catalogued every error across the 25 trajectories and clustered them. The same handful of patterns appear over and over.

Four clusters appear in essentially every failed run, and they compound:

App bring-up and navigation overhead (84%). First-run dialogs, settings-path prompts, update checks, the Symbol Library Table dialog — and worst of all, clicking through to the PCB Editor when the task wants the Schematic Editor. Recovering from a wrong-editor state costs 25–70 turns of pure overhead before any productive work begins.
One-action-per-turn cadence with verbose narration (84%). The agent treats each turn as a single click followed by a paragraph of self-narration. A component placement that an engineer would batch into three actions becomes ten turns of “Now I will click Place. Now I will select Symbol. Now I will type the part name…” Every avoided shortcut becomes a multi-click menu chase.
Zoom and pan oscillation (76%). Without using the Home key (zoom-to-fit), the agent gets stuck scroll-wheeling between Z=0.55 and Z=100, hunting for components it has already lost off-screen.
Wiring never drawn (72%). When the agent does place components, it rarely closes the loop by drawing wires. Across all 16 max-step failures, zero schematics had all required wires drawn.

Rolled up to root causes: 40% of error-mode mentions are about planning and policy, 22% about perception, 19% about navigation inefficiency, 11% about domain knowledge gaps, and just 8% about the tool/API surface itself. Crucially, there were zero API errors anywhere in the cohort, proving that the underlying harness works as expected. The failures are in how the agent uses it.

3.3 Step budget is not the bottleneck

It is tempting to look at “16 max-steps failures” and conclude the agent just needs a bigger budget. The data rejects that interpretation. We extended two unfinished runs with a 500-step allowance, and both still failed, while one of them earned a flat zero (using 467 and 488 steps respectively.) Meanwhile, every full pass finished in under 150 turns. More steps do not help when the underlying plan is wrong.

In a typical failed run, roughly 30–40% of the budget goes to app bring-up and recovering from wrong-editor states, 40% to slow component-by-component placement, and 20–30% to stuck loops where the agent retries the same click after no UI change. By the time the agent is ready to wire, it has no budget left.

3.4 Self-verification is brittle

Five of the nine runs that emitted a DONE token failed verification — usually because the agent verified by re-narrating its own intent rather than reading ground truth from the UI. A few representative cases:

Claimed “R10 set to 2.80kOhm and saved” — but typed the value literally as “2.80kOhm” instead of the canonical KiCad form “2.8k”. Topology checks passed, value checks failed.
Claimed “R2 = 16.5k for V_REF = 1.24V” on an LT3010 LDO — but the LT3010’s reference voltage is 0.808V, not 1.24V, so the correct R2 was ~9.31k.
Claimed a complete current loop “BT1+ → R1 → D1 → D2 → BT1−” — but D2 was actually dangling in the schematic.

In every case, reading the title bar (which shows an asterisk on unsaved changes), the status bar, the ERC dialog, or the saved netlist text would have caught the error. The agent reasoned over its own narration instead.

Wrapping up

Cua-Bench EDA gives a sharp, fair picture of where computer-use agents stand on real professional software in mid-2026: confident on short, linear edits; unable to plan and execute the multi-phase workflows that everyday engineering work demands. The good news is that the failure modes are concrete, repeatable, and addressable – they are not pointing to a missing underlying capability so much as a missing meta-policy for how to use the capabilities the model already has.

We are continuing to extend the suite. We are running the same tasks against additional frontier models, with more results publishing this week. We look forward to seeing how the next generation of agents handles them. Thanks again to the Cua team for the harness, the verifier, and the support that made this analysis possible. For more, visit cua.ai and see Cua’s launch note.

Share this article

Zhengyang (Jason) Qi

Research Scientist

I am an aspiring AI researcher with a diverse range of experience in frontier AI research, large scalable machine learning systems, and applied analytics in social science. I believe in the interactionist approach to intelligence development, through granular feedbacks from grounded, open-ended environments, where robust rewards are essential to forge systems that learn, adapt, and evolve through interactions.

Armin Parchami

Sr. Director, R&D

Armin Parchami is the Senior Director, R&D, at Snorkel AI, where he leads work on synthetic data, data quality, and model fine-tuning. He previously held technical leadership roles at Ford and Nokia Bell Labs, focusing on multimodal AI and autonomy. His work centers on moving research into production.