SNORKEL DATA SERIES //
Specialized Computer Use Agents //
CUA-Bench+
Stress-test & train agents on the specialized tools real technical professionals use
CUA-Bench+ is a data series for training and evaluating computer use agents on realistic GUI-based work. It covers the specialized desktop and mobile applications that engineers, designers, and other technical professionals work in every day.
Built by Snorkel's AI Data Research Lab, CUA-Bench+ spans thousands of expert-curated tasks across Linux, Windows, macOS, and Android, built to test multi-step reasoning, error recovery, and multi-application workflows, with difficulty calibrated against current frontier models.
Domain expert-curated tasks that produce real professional artifacts
Artifact creation
Building new structured professional artifacts from scratchArtifact modification
Editing existing artifacts to meet specific constraintsStructured export & conversion
Exporting artifacts into specified deterministic formatsProject assembly
Constructing multi-file project structures with correct relationshipsCross-tool integration
Producing consistent artifacts across multiple applicationsDomain-specific structured output
Producing structured domain-specific definitions
CUA-Bench+ is intentionally calibrated to stress-test state-of-the-art agents
Built for frontier model evaluation and training:
- Empirical difficulty tiers measured against current frontier models
- Complex tasks where today's leading agents pass less than 20% of attempts
- Complex-tier tasks across multiple applications
Success is scored from the saved output artifact, not the action trace.
Why the Snorkel Data Series
Expert-led validation
Every task is built and validated through a multi-layer quality pipeline.
Expert review
Expert contributors author every task; subject-matter experts verify accuracy, clarity, and environment integrity.
Programmatic validation
Automated checks ensure manifest completeness, deterministic pass rates, and task uniqueness.
Difficulty calibration
Difficulty labels are validated against performance data from frontier LLMs.
Distribution guardrails
Submissions are filtered to maintain balanced distribution across categories, platforms, and complexity levels.