workplace agents //
Train & evaluate frontier agents on the professional work the economy runs on
GDPval+ is Snorkel’s data series for training and evaluating whether AI can do a broad set of professional jobs across domains, roles, and industries.
Developed by Snorkel's AI Data Research Lab, GDPval+ delivers longer-horizon tasks that produces tangible deliverables like a document, spreadsheet, or presentation, drawn from real workflows. With domain expert-curated tasks across all 20 O*NET sectors and 100+ occupations, you can cover up to 100% of the U.S. labor market.
Sector coverage includes
- Manufacturing: Plant operations, supply chain, and engineering deliverables
Professional, scientific & technical services: Legal, market research, microbiology, and information security work products
Health care & social assistance: Clinical formulations, authorization packages, and care-team workflows
Educational services: Curriculum design, assessment, and instructional materials
- Construction: Project planning, site inspection, and compliance deliverables
Other services: Repair, personal services, and civic-organization workflows
Public administration: Government and emergency-management deliverables
Retail trade: Retail operations, merchandising, and customer-service workflows
Transportation & warehousing: Logistics planning, dispatch, and warehouse-operations tasks
Arts, entertainment & recreation: Creative production and venue-operations workflows
Plus 10 more sectors covering the rest of the U.S. digital labor market.
GDPval+ is intentionally calibrated to stress-test state-of-the-art agents
- Empirical difficulty tiers measured against current frontier models, not author judgment
- A frontier tier where today's leading models score below 20%
- Every task graded against a weighted, expert-authored rubric
Why the Snorkel Data Series
Expert-led validation
Every task is built and validated through a multi-layer quality pipeline.
Expert review
Expert contributors author every task; subject-matter experts review each one against acceptance criteria and metadata accuracy.
Programmatic checks
Automated validation ensures task uniqueness, minimum resource requirements, and rubric quality.
Difficulty validation
Task difficulty labels are validated against observed accuracy from a panel of frontier models.
Distribution guardrails
New submissions are accepted only if they maintain dataset balance across task types, difficulty levels, and categories.