Open Benchmarks Grants

OSWorld 2.0

A benchmark for evaluating computer-use agents on long-horizon, real-world workflows: 108 authentic tasks across 31 self-hosted web environments and professional desktop applications, with partial-credit scoring (avg 27.25 checkpoints per task).

Built with

Overview

Computer-use agents are increasingly deployed on multi-hour professional workflows, but most benchmarks evaluate them on short desktop tasks that finish in under 30 steps. OSWorld 2.0 reframes the problem around long-horizon work, sourced from realistic end-to-end workflows that a skilled human typically takes over an hour to complete.

Headline finding: At the 500-step budget, no system completes more than 21% of tasks end-to-end. Claude Opus 4.8 leads at 20.6% binary completion and 54.8% partial score; partial scores cluster in the 20–55% range across all evaluated systems, meaning frontier agents make meaningful progress but rarely finish.

At a glance

108

Long-horizon tasks

31

self-hosted websites

69.6%

take humans over 1 hour

>250

average agent steps

27.25

avg scoring checkpoints

The 108 tasks span seven professional domains and 21 sub-categories, covering research, creative production, engineering, personal services, business and finance, administration and compliance, and healthcare workflows. Tasks map to occupation families and SOC major groups, with a wage-bill-based GDP proxy estimating economic coverage. The largest shares come from document preparation, software and databases, and finance and operations analysis, with a long tail of additional professional activities.

Leaderboard

Model	Effort	Approach	Binary	Partial	Est. Cost
claude-opus-4-8	max	Batched tool	20.6%	54.8%	n/a
claude-opus-4-8	max	Standard	18.52%	49.33%	n/a
claude-opus-4-7	max	Batched tool	18.2%	48.91%	n/a
claude-opus-4-7	max	Standard	13.9%	49.1%	$3.87K
gpt-5-5	xhigh	Batch tool	13%	49.5%	$2.75K
claude-sonnet-4-6	medium	Standard	9.3%	33.9%	$1.55K
claude-sonnet-4-6	max	Standard	8.3%	41.5%	$2.41K
minimax-m3	enabled	Standard	4.6%	22.3%	$258.78
kimi-2-6	enabled	Standard	4.6%	22.1%	$708
qwen-3-7-plus	thinking	Standard	2.8%	21.5%	$411.56

Model	Effort	Approach	Binary	Partial	Est. Cost
gpt-5-5	xhigh	Batch tool	13%	49.5%	$2.75K
claude-opus-4-7	max	Standard	13%	39.8%	$2.47K
claude-sonnet-4-6	medium	Standard	8.3%	29.4%	$990
claude-sonnet-4-6	max	Standard	6.5%	35.8%	$1.72K
kimi-2-6	enabled	Standard	4.6%	14.4%	$604
minimax-m3	enabled	Standard	3.7%	16.6%	$182.03
qwen-3-7-plus	thinking	Standard	1.9%	16.6%	$403.33

Model	Effort	Approach	Binary	Partial	Est. Cost
gpt-5-5	xhigh	Batch tool	13%	46.7%	$1.88K
claude-opus-4-7	max	Standard	4.6%	20.3%	$1.03K
claude-sonnet-4-6	max	Standard	4.6%	20%	$800
claude-sonnet-4-6	medium	Standard	4.6%	14.2%	$410
minimax-m3	enabled	Standard	1.9%	8.2%	$86.82
kimi-2-6	enabled	Standard	1.9%	7.1%	$336

Trajectory showcase

Inspect complete agent trajectories step by step. Explore all task trajectories.

Task 004

Slide formatting

Task 008

Oracle reimbursement

Task 024

DS-2019 Visa

Task 035

Purchase requests

Task 052

Travel booking

Task 053

Video masking

Task 098

DS-160 Visa Form

Task 103

Free CAD Bracket

OSWorld 1.0 vs 2.0

OSWorld 2.0 is a substantial expansion of the original OSWorld evaluation: tasks span far more agent steps, cross more applications, run inside reproducible self-hosted environments, and use partial credit instead of binary completion alone.

Capability

OSWorld 1.0

OSWorld 2.0

Task horizon

<30 average agent steps

>250 average agent steps

Cross-application tasks

Supported in a minority

Majority; information-dependent

Self-hosted web environments

—

31 websites

Input artifacts

Mixed / synthetic

Authentic

Challenge categories

—

10 challenge phenomena

Scoring

Binary

Partial reward; 27.25 checkpoints on average

Safety audit

—

8 diagnostic checks

User interaction

—

Simulated user

Methodology

Metrics

Binary completion (strict success) and Score (partial credit averaged across an average of 27.25 scoring checkpoints per task). Both computed against deterministic graders; 11.53% of the score is from validated model-based checks.

Self-hosted environments

31 self-hosted, high-fidelity websites with deterministic initialization, isolated execution, and reliable final-state scoring. Preserves open-web access while ensuring reproducibility across runs.

Step budgets

Submissions are scored at 150 / 300 / 500 agent-step budgets. The 500-step budget mirrors realistic long-horizon work; the 150-step budget surfaces efficiency.

Safety audit

A separate audit pipeline runs 8 diagnostic checks on each trajectory. Safety reports are scored independently from task completion.

What the results show

Three patterns recur across the evaluated agents.

1

Higher scores require disproportionately more tokens.

Crossing the 50% partial-score threshold requires order-of-magnitude more tokens than reaching 25%. Efficiency scales worse than capability.

2

Task horizon remains a hard limit.

Binary completion collapses as task length grows. On the longest workflows in the corpus, top frontier agents approach near-zero end-to-end completion regardless of step budget.

3

Agents are weak at recovering and maintaining hidden state.

When tasks require tracking unobserved or evolving context across steps, agents lose track — repeating earlier work, missing updates, or executing from stale plans.

Resources

Paper

Github

Data

Task showcase

Website

Acknowledgments

OSWorld 2.0 is developed by XLANG Lab, with contributions from Snorkel AI researchers Zhengyang Qi (Jason), Vincent Sunn Chen, and Frederic Sala. Snorkel AI is the research and data partner on this project.

Get notified when we launch a new benchmark

Share this benchmark

OSWorld 2.0

At a glance

108

31

69.6%

>250

27.25

Leaderboard

Trajectory showcase

Slide formatting

Oracle reimbursement

DS-2019 Visa

Purchase requests

Travel booking

Video masking

DS-160 Visa Form

Free CAD Bracket

OSWorld 1.0 vs 2.0

Methodology

What the results show

Higher scores require disproportionately more tokens.

Higher scores require disproportionately more tokens.

Task horizon remains a hard limit.

Agents are weak at recovering and maintaining hidden state.

Resources

Acknowledgments

Get notified when we launch a new benchmark

More benchmarks

Senior SWE-bench

Agents’ Last Exam

Agentic Coding

SlopCode Bench

Continual Learning Bench

Terminal-Bench 2.1

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?