OSWorld 2.0
A benchmark for evaluating computer-use agents on long-horizon, real-world workflows: 108 authentic tasks across 31 self-hosted web environments and professional desktop applications, with partial-credit scoring (avg 27.25 checkpoints per task).


Computer-use agents are increasingly deployed on multi-hour professional workflows, but most benchmarks evaluate them on short desktop tasks that finish in under 30 steps. OSWorld 2.0 reframes the problem around long-horizon work, sourced from realistic end-to-end workflows that a skilled human typically takes over an hour to complete.
Headline finding: At the 500-step budget, no system completes more than 21% of tasks end-to-end. Claude Opus 4.8 leads at 20.6% binary completion and 54.8% partial score; partial scores cluster in the 20–55% range across all evaluated systems, meaning frontier agents make meaningful progress but rarely finish.
At a glance
108
Long-horizon tasks
31
69.6%
take humans over 1 hour
>250
average agent steps
27.25
avg scoring checkpoints
Leaderboard
| Model | Effort | Approach | Binary | Partial | Est. Cost |
|---|---|---|---|---|---|
| claude-opus-4-8 | max | Batched tool |
20.6%
|
54.8%
|
n/a |
| claude-opus-4-8 | max | Standard |
18.52%
|
49.33%
|
n/a |
| claude-opus-4-7 | max | Batched tool |
18.2%
|
48.91%
|
n/a |
| claude-opus-4-7 | max | Standard |
13.9%
|
49.1%
|
$3.87K |
| gpt-5-5 | xhigh | Batch tool |
13.9%
|
49.5%
|
$2.75K |
| claude-sonnet-4-6 | medium | Standard |
9.3%
|
33.9%
|
$1.55K |
| claude-sonnet-4-6 | max | Standard |
10.2%
|
41.5%
|
$2.41K |
| minimax-m3 | enabled | Standard |
4.6%
|
22.3%
|
$258.78 |
| kimi-2-6 | enabled | Standard |
4.6%
|
22.1%
|
$708 |
| qwen-3-7-plus | thinking | Standard |
2.8%
|
21.5%
|
$411.56 |
| Model | Effort | Approach | Binary | Partial | Est. Cost |
|---|---|---|---|---|---|
| gpt-5-5 | xhigh | Batch tool |
13%
|
49.5%
|
$2.75K |
| claude-opus-4-7 | max | Standard |
13%
|
39.8%
|
$2.47K |
| claude-sonnet-4-6 | medium | Standard |
8.3%
|
29.4%
|
$990 |
| claude-sonnet-4-6 | max | Standard |
6.5%
|
35.8%
|
$1.72K |
| kimi-2-6 | enabled | Standard |
4.6%
|
14.4%
|
$604 |
| minimax-m3 | enabled | Standard |
3.7%
|
16.6%
|
$182.03 |
| qwen-3-7-plus | thinking | Standard |
1.9%
|
16.6%
|
$403.33 |
| Model | Effort | Approach | Binary | Partial | Est. Cost |
|---|---|---|---|---|---|
| gpt-5-5 | xhigh | Batch tool |
13%
|
46.7%
|
$1.88K |
| claude-opus-4-7 | max | Standard |
4.6%
|
20.3%
|
$1.03K |
| claude-sonnet-4-6 | max | Standard |
4.6%
|
20%
|
$800 |
| claude-sonnet-4-6 | medium | Standard |
4.6%
|
14.2%
|
$410 |
| minimax-m3 | enabled | Standard |
1.9%
|
8.2%
|
$86.82 |
| kimi-2-6 | enabled | Standard |
1.9%
|
7.1%
|
$336 |
Trajectory showcase
Inspect complete agent trajectories step by step. Explore all task trajectories.
OSWorld 1.0 vs 2.0
OSWorld 2.0 is a substantial expansion of the original OSWorld evaluation: tasks span far more agent steps, cross more applications, run inside reproducible self-hosted environments, and use partial credit instead of binary completion alone.
OSWorld 1.0
OSWorld 2.0
10 challenge phenomena
Methodology
Metrics
Submissions are scored at 150 / 300 / 500 agent-step budgets. The 500-step budget mirrors realistic long-horizon work; the 150-step budget surfaces efficiency.
Safety audit
A separate audit pipeline runs 8 diagnostic checks on each trajectory. Safety reports are scored independently from task completion.
What the results show
Three patterns recur across the evaluated agents.
Higher scores require disproportionately more tokens.
Higher scores require disproportionately more tokens.
Crossing the 50% partial-score threshold requires order-of-magnitude more tokens than reaching 25%. Efficiency scales worse than capability.
Task horizon remains a hard limit.
Binary completion collapses as task length grows. On the longest workflows in the corpus, top frontier agents approach near-zero end-to-end completion regardless of step budget.
Agents are weak at recovering and maintaining hidden state.
When tasks require tracking unobserved or evolving context across steps, agents lose track — repeating earlier work, missing updates, or executing from stale plans.
Resources
Acknowledgments
OSWorld 2.0 is developed by XLANG Lab, with contributions from Snorkel AI researchers Zhengyang Qi (Jason), Vincent Sunn Chen, and Frederic Sala. Snorkel AI is the research and data partner on this project.
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.









