Stress-test & train frontier models on multimodal scientific reasoning

Mu-STEM is scientific reasoning data that goes beyond recall. Tasks combine text with figures, diagrams, PDFs, and data, requiring models to derive outcomes from multimodal inputs like the materials scientists actually work with.

Developed by Snorkel's AI Data Research Lab, Mu-STEM tasks require multi-step derivation and are graded on methodological rigor, not just the final answer — the building blocks of genuine scientific discovery.

REQUEST DATA SAMPLES //

By submitting this form, I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Domain coverage includes

Mathematics
Algebra, calculus, probability, statistics, topology, and applied math

Physics
Classical mechanics, quantum physics, thermodynamics, astrophysics, and optics

Chemistry
Physical, organic, inorganic, analytical, biochemistry, and materials chemistry

Biology
Genetics, molecular & cellular biology, neuroscience, immunology, and biotechnology

Computer Science
Algorithms, systems & architecture, AI/ML, databases, and networking

Engineering
Mechanical, electrical, civil, chemical, materials, biomedical, and aerospace

Mu-STEM is intentionally calibrated to stress-test state-of-the-art models

Built for frontier model evaluation:

Tiered difficulty across Core, Advanced and Frontier
Frontier tasks where at least 4 of 5 leading models fail
Evaluated against GPT 5.2, Gemini 3.1 Pro, Claude Sonnet 4.6, Qwen3 VL 235B, and Pixtral Large 25.02; graded by Claude Opus 4.6

Building blocks of genuine scientific discovery.