Stress-test & train agents on the specialized tools real technical professionals use

CUA-Bench+ is a data series for training and evaluating computer use agents on realistic GUI-based work. It covers the specialized desktop and mobile applications that engineers, designers, and other technical professionals work in every day.

Built by Snorkel's AI Data Research Lab, CUA-Bench+ spans thousands of expert-curated tasks across Linux, Windows, macOS, and Android, built to test multi-step reasoning, error recovery, and multi-application workflows, with difficulty calibrated against current frontier models.

REQUEST DATA SAMPLES //

By submitting this form, I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Domain expert-curated tasks that produce real professional artifacts

Artifact creation
Building new structured professional artifacts from scratch

Artifact modification
Editing existing artifacts to meet specific constraints

Structured export & conversion
Exporting artifacts into specified deterministic formats

Project assembly
Constructing multi-file project structures with correct relationships

Cross-tool integration
Producing consistent artifacts across multiple applications

Domain-specific structured output
Producing structured domain-specific definitions

CUA-Bench+ is intentionally calibrated to stress-test state-of-the-art agents

Built for frontier model evaluation and training:

Empirical difficulty tiers measured against current frontier models
Complex tasks where today's leading agents pass less than 20% of attempts
Complex-tier tasks across multiple applications

Success is scored from the saved output artifact, not the action trace.

Why the Snorkel Data Series

High-volume quarterly drops

Multi-layer quality pipeline

Unified execution environment

Direct roadmap influence

Expert-led validation

Every task is built and validated through a multi-layer quality pipeline.

Expert review

Expert contributors author every task; subject-matter experts verify accuracy, clarity, and environment integrity.

Programmatic validation

Automated checks ensure manifest completeness, deterministic pass rates, and task uniqueness.

Difficulty calibration

Difficulty labels are validated against performance data from frontier LLMs.

Distribution guardrails

Submissions are filtered to maintain balanced distribution across categories, platforms, and complexity levels.

Train agents that can navigate GUIs and produce real professional artifacts with the Snorkel Data Series

Talk to a researcher