blur-bg-frame-lightblur-bg-frame-dark
dark-curve-bglight-curve-bg

SNORKEL DATA SERIES //

Specialized Computer Use Agents //

CUA-Bench+

Stress-test & train agents on the specialized tools real technical professionals use

CUA-Bench+ is a data series for training and evaluating computer use agents on realistic GUI-based work. It covers the specialized desktop and mobile applications that engineers, designers, and other technical professionals work in every day.

Built by Snorkel's AI Data Research Lab, CUA-Bench+ spans thousands of expert-curated tasks across Linux, Windows, macOS, and Android, built to test multi-step reasoning, error recovery, and multi-application workflows, with difficulty calibrated against current frontier models.

REQUEST DATA SAMPLES //
By submitting this form, I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Domain expert-curated tasks that produce real professional artifacts

  • Artifact creation
    Building new structured professional artifacts from scratch

  • Artifact modification
    Editing existing artifacts to meet specific constraints

  • Structured export & conversion
    Exporting artifacts into specified deterministic formats

  • Project assembly
    Constructing multi-file project structures with correct relationships

  • Cross-tool integration
    Producing consistent artifacts across multiple applications

  • Domain-specific structured output
    Producing structured domain-specific definitions

CUA-Bench+ is intentionally calibrated to stress-test state-of-the-art agents

Built for frontier model evaluation and training:

  • Empirical difficulty tiers measured against current frontier models
  • Complex tasks where today's leading agents pass less than 20% of attempts
  • Complex-tier tasks across multiple applications

Success is scored from the saved output artifact, not the action trace.

Why the Snorkel Data Series

High volume quarterly drops icon
High-volume quarterly drops
Multi layer quality pipeline icon
Multi-layer quality pipeline
Unified execution environment icon
Unified execution environment
Direct roadmap influence icon
Direct roadmap influence

Expert-led validation

Every task is built and validated through a multi-layer quality pipeline.

01

Expert review

Expert contributors author every task; subject-matter experts verify accuracy, clarity, and environment integrity.

02

Programmatic validation

Automated checks ensure manifest completeness, deterministic pass rates, and task uniqueness.

03

Difficulty calibration

Difficulty labels are validated against performance data from frontier LLMs.

04

Distribution guardrails

Submissions are filtered to maintain balanced distribution across categories, platforms, and complexity levels.

feather graphics blur image
feather graphics normal image

Train agents that can navigate GUIs and produce real professional artifacts with the Snorkel Data Series