Open Benchmarks Grants

OSWorld 2.0

A benchmark for evaluating computer-use agents on long-horizon, real-world workflows: 108 authentic tasks across 31 self-hosted web environments and professional desktop applications, with partial-credit scoring (avg 27.25 checkpoints per task).

Built with
ImageSnorkel AI logo lockup mono white outline png
Overview

Computer-use agents are increasingly deployed on multi-hour professional workflows, but most benchmarks evaluate them on short desktop tasks that finish in under 30 steps. OSWorld 2.0 reframes the problem around long-horizon work, sourced from realistic end-to-end workflows that a skilled human typically takes over an hour to complete.

Headline finding: At the 500-step budget, no system completes more than 21% of tasks end-to-end. Claude Opus 4.8 leads at 20.6% binary completion and 54.8% partial score; partial scores cluster in the 20–55% range across all evaluated systems, meaning frontier agents make meaningful progress but rarely finish.

At a glance

108

Long-horizon tasks

31

self-hosted websites

69.6%

take humans over 1 hour

>250

average agent steps

27.25

avg scoring checkpoints

The 108 tasks span seven professional domains and 21 sub-categories, covering research, creative production, engineering, personal services, business and finance, administration and compliance, and healthcare workflows. Tasks map to occupation families and SOC major groups, with a wage-bill-based GDP proxy estimating economic coverage. The largest shares come from document preparation, software and databases, and finance and operations analysis, with a long tail of additional professional activities.

Leaderboard

Model Effort Approach Binary Partial Est. Cost
claude-opus-4-8 max Batched tool
20.6%
54.8%
n/a
claude-opus-4-8 max Standard
18.52%
49.33%
n/a
claude-opus-4-7 max Batched tool
18.2%
48.91%
n/a
claude-opus-4-7 max Standard
13.9%
49.1%
$3.87K
gpt-5-5 xhigh Batch tool
13.9%
49.5%
$2.75K
claude-sonnet-4-6 medium Standard
9.3%
33.9%
$1.55K
claude-sonnet-4-6 max Standard
10.2%
41.5%
$2.41K
minimax-m3 enabled Standard
4.6%
22.3%
$258.78
kimi-2-6 enabled Standard
4.6%
22.1%
$708
qwen-3-7-plus thinking Standard
2.8%
21.5%
$411.56
Model Effort Approach Binary Partial Est. Cost
gpt-5-5 xhigh Batch tool
13%
49.5%
$2.75K
claude-opus-4-7 max Standard
13%
39.8%
$2.47K
claude-sonnet-4-6 medium Standard
8.3%
29.4%
$990
claude-sonnet-4-6 max Standard
6.5%
35.8%
$1.72K
kimi-2-6 enabled Standard
4.6%
14.4%
$604
minimax-m3 enabled Standard
3.7%
16.6%
$182.03
qwen-3-7-plus thinking Standard
1.9%
16.6%
$403.33
Model Effort Approach Binary Partial Est. Cost
gpt-5-5 xhigh Batch tool
13%
46.7%
$1.88K
claude-opus-4-7 max Standard
4.6%
20.3%
$1.03K
claude-sonnet-4-6 max Standard
4.6%
20%
$800
claude-sonnet-4-6 medium Standard
4.6%
14.2%
$410
minimax-m3 enabled Standard
1.9%
8.2%
$86.82
kimi-2-6 enabled Standard
1.9%
7.1%
$336

OSWorld 1.0 vs 2.0

OSWorld 2.0 is a substantial expansion of the original OSWorld evaluation: tasks span far more agent steps, cross more applications, run inside reproducible self-hosted environments, and use partial credit instead of binary completion alone.

Capability

OSWorld 1.0

OSWorld 2.0

Task horizon
<30 average agent steps
>250 average agent steps
Cross-application tasks
Supported in a minority
Majority; information-dependent
Self-hosted web environments
31 websites
Input artifacts
Mixed / synthetic
Authentic
Challenge categories

10 challenge phenomena

Scoring
Binary
Partial reward; 27.25 checkpoints on average
Safety audit
8 diagnostic checks
User interaction
Simulated user

Methodology

Metrics

Binary completion (strict success) and Score (partial credit averaged across an average of 27.25 scoring checkpoints per task). Both computed against deterministic graders; 11.53% of the score is from validated model-based checks.
Self-hosted environments
31 self-hosted, high-fidelity websites with deterministic initialization, isolated execution, and reliable final-state scoring. Preserves open-web access while ensuring reproducibility across runs.
Step budgets

Submissions are scored at 150 / 300 / 500 agent-step budgets. The 500-step budget mirrors realistic long-horizon work; the 150-step budget surfaces efficiency.

Safety audit

A separate audit pipeline runs 8 diagnostic checks on each trajectory. Safety reports are scored independently from task completion.

What the results show

Three patterns recur across the evaluated agents.

1

Higher scores require disproportionately more tokens.

Crossing the 50% partial-score threshold requires order-of-magnitude more tokens than reaching 25%. Efficiency scales worse than capability.

2

Task horizon remains a hard limit.

Binary completion collapses as task length grows. On the longest workflows in the corpus, top frontier agents approach near-zero end-to-end completion regardless of step budget.

3

Agents are weak at recovering and maintaining hidden state.

When tasks require tracking unobserved or evolving context across steps, agents lose track — repeating earlier work, missing updates, or executing from stale plans.

Acknowledgments

OSWorld 2.0 is developed by XLANG Lab, with contributions from Snorkel AI researchers Zhengyang Qi (Jason), Vincent Sunn Chen, and Frederic Sala. Snorkel AI is the research and data partner on this project.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.