blur-bg-frame-lightblur-bg-frame-dark
dark-curve-bglight-curve-bg

SNORKEL DATA SERIES //

Scientific Reasoning //

Mu-STEM

Stress-test & train frontier models on multimodal scientific reasoning

Mu-STEM is scientific reasoning data that goes beyond recall. Tasks combine text with figures, diagrams, PDFs, and data, requiring models to derive outcomes from multimodal inputs like the materials scientists actually work with.

Developed by Snorkel's AI Data Research Lab, Mu-STEM tasks require multi-step derivation and are graded on methodological rigor, not just the final answer — the building blocks of genuine scientific discovery.

REQUEST DATA SAMPLES //
By submitting this form, I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Domain coverage includes

  • Mathematics
    Algebra, calculus, probability, statistics, topology, and applied math

  • Physics
    Classical mechanics, quantum physics, thermodynamics, astrophysics, and optics

  • Chemistry
    Physical, organic, inorganic, analytical, biochemistry, and materials chemistry

  • Biology
    Genetics, molecular & cellular biology, neuroscience, immunology, and biotechnology

  • Computer Science
    Algorithms, systems & architecture, AI/ML, databases, and networking

  • Engineering
    Mechanical, electrical, civil, chemical, materials, biomedical, and aerospace

Mu-STEM is intentionally calibrated to stress-test state-of-the-art models

Built for frontier model evaluation:

  • Tiered difficulty across Core, Advanced and Frontier
  • Frontier tasks where at least 4 of 5 leading models fail
  • Evaluated against GPT 5.2, Gemini 3.1 Pro, Claude Sonnet 4.6, Qwen3 VL 235B, and Pixtral Large 25.02; graded by Claude Opus 4.6

Building blocks of genuine scientific discovery.

Why the Snorkel Data Series

High volume quarterly drops icon
High-volume quarterly drops
Multi layer quality pipeline icon
Multi-layer quality pipeline
Unified execution environment icon
Unified execution environment
Direct roadmap influence icon
Direct roadmap influence

Expert-led validation

Every task is built and validated through a multi-layer quality pipeline.

01

Expert review

Every submission passes an independent expert review stage, ensuring each datapoint meets the specified requirements.

02

Automated evaluators

Automatic schema and agentic validation is provided to both expert submitters and reviewers.

03

Preference validation

Each reference answer passes a preference validation stage, ensuring it meets or exceeds the quality of frontier LLM-generated answers.

04

Rubric alignment

Submitted rubrics must reproduce expert preference feedback before acceptance.

feather graphics blur image
feather graphics normal image

Train models on the building blocks of scientific discovery with the Snorkel Data Series