OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Existing computer-use benchmarks fail to capture the realism, complexity, and long-
horizon demands of real-world computer use, limiting their ability to reveal the limita-
tions of frontier agents. We introduce OSWORLD 2.0, a benchmark of 108 long-horizon
computer-use workflows across everyday and professional tasks, designed to capture
complex and challenging real-world phenomena. Each task represents a realistic end-
to-end workflow that takes human users a median of about 1.6 hours to complete and
requires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking,
compared with about 30 in OSWORLD 1.0. OSWORLD 2.0 targets challenge phenomena
that are common in real workflows yet underrepresented in prior benchmarks, spanning
interaction-design challenges such as streaming interaction and dynamic environments,
as well as agent-pattern challenges such as cross-source reasoning, implicit-state infer-
ence, and visual-spatial precision. Tasks are grounded in authentic input artifacts and
cross-referenced against realistic stateful user profile data, and include separate safety
reports auditing safety-sensitive execution. Under our primary binary-completion metric
at 500 steps, Claude Opus 4.8 with maximum thinking and batched tool calls scores
best but still completes only 20.6% of tasks at a 54.8% partial score; GPT-5.5 is far more
token-efficient yet plateaus near 13%. These results show that current agents are still
far from professional-level computer use: rather than stumbling on basic GUI control
or coding, they lose track of constraints, miss information that arrives mid-task, guess
rather than ask the user, and skip verification, struggling most when a task hinges on
hidden state they must recover.

OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Abstract

Join our newsletter

How do you want to work with Snorkel?