SNORKEL DATA SERIES //
Scientific Reasoning //
Mu-STEM
Stress-test & train frontier models on multimodal scientific reasoning
Mu-STEM is scientific reasoning data that goes beyond recall. Tasks combine text with figures, diagrams, PDFs, and data, requiring models to derive outcomes from multimodal inputs like the materials scientists actually work with.
Developed by Snorkel's AI Data Research Lab, Mu-STEM tasks require multi-step derivation and are graded on methodological rigor, not just the final answer — the building blocks of genuine scientific discovery.
Domain coverage includes
Mathematics
Algebra, calculus, probability, statistics, topology, and applied mathPhysics
Classical mechanics, quantum physics, thermodynamics, astrophysics, and opticsChemistry
Physical, organic, inorganic, analytical, biochemistry, and materials chemistryBiology
Genetics, molecular & cellular biology, neuroscience, immunology, and biotechnologyComputer Science
Algorithms, systems & architecture, AI/ML, databases, and networkingEngineering
Mechanical, electrical, civil, chemical, materials, biomedical, and aerospace
Mu-STEM is intentionally calibrated to stress-test state-of-the-art models
Built for frontier model evaluation:
- Tiered difficulty across Core, Advanced and Frontier
- Frontier tasks where at least 4 of 5 leading models fail
- Evaluated against GPT 5.2, Gemini 3.1 Pro, Claude Sonnet 4.6, Qwen3 VL 235B, and Pixtral Large 25.02; graded by Claude Opus 4.6
Building blocks of genuine scientific discovery.
Why the Snorkel Data Series
Expert-led validation
Every task is built and validated through a multi-layer quality pipeline.
Expert review
Every submission passes an independent expert review stage, ensuring each datapoint meets the specified requirements.
Automated evaluators
Automatic schema and agentic validation is provided to both expert submitters and reviewers.
Preference validation
Each reference answer passes a preference validation stage, ensuring it meets or exceeds the quality of frontier LLM-generated answers.
Rubric alignment
Submitted rubrics must reproduce expert preference feedback before acceptance.