Terminal-Bench Science: Contribute your scientific workflows as tasks for AI Agents

May 27, 2026

The Terminal-Bench team is extending Terminal-Bench to complex scientific workflow tasks in the natural sciences.


What is Terminal-Bench Science?

Terminal-Bench Science is a benchmark for evaluating AI agents on real computational workflows from scientific research. It builds on Terminal-Bench, which has been adopted by frontier labs including Anthropic, OpenAI, and Google DeepMind and has helped drive progress in AI agents on software engineering tasks by defining what those labs measure and optimize for. Terminal-Bench Science brings the same approach to the natural sciences.

Why do we need Terminal-Bench Science?

Most existing “AI for Science” benchmarks test textbook knowledge, not real workflows. Terminal-Bench Science closes this gap with real computational workflow tasks from research labs, evaluated in containerized environments with programmatic verification. The goal is to give scientists a direct voice in shaping AI progress: domain experts contribute scientific workflows as benchmark tasks, frontier labs evaluate and improve their AI agents against them, and the improved AI agents with stronger scientific capabilities flow back as better tools for researchers.

Domain Coverage

Terminal-Bench Science is targeting 100+ benchmark tasks across the life sciences, physical sciences, and earth sciences, but is also open to tasks from the mathematical sciences and other domains with computational workflows.

DomainAreas
Life SciencesBiology, Medicine, Neuroscience
Physical SciencesPhysics, Chemistry, Astronomy, Materials Science
Earth SciencesAtmospheric Science, Geoscience, Water Science
Mathematical SciencesApplied Mathematics, Statistics, Autoformalization
OtherInterdisciplinary Sciences, Computational Sciences, Engineering Sciences, etc.

Why contribute?

  • Make AI better at your science: Frontier labs optimize for what benchmarks measure. Your tasks directly incentivize them to improve their AI systems on the scientific problems in your domain.
  • Gain experience in agentic evaluation: Get hands-on with evaluating frontier AI agents — learn how to design rigorous benchmarks and see firsthand where today’s best models succeed and fail on real scientific work.
  • Become a co-author: Contributors with merged tasks receive co-authorship on the Terminal-Bench Science paper.

What Terminal-Bench team looks for

The Terminal-Bench team looking for complex, real-world computational workflows from practicing scientists across the natural sciences that meet the following three key criteria:

  1. Scientifically grounded. Tasks should reflect computational workflows from real research in the natural sciences, ideally drawn from active research projects or designed to replicate published results within a domain of expertise.
  2. Objectively verifiable. Solutions must be programmatically checkable through deterministic pytest-based evaluation. Open-ended tasks such as hypothesis generation or literature review are not suitable.
  3. Genuinely difficult. Tasks should be challenging for today’s most capable AI agents and expose meaningful capability gaps. The target difficulty level corresponds to an expected solve rate of approximately 10–20% at release.

Tasks follow the Harbor Task Format. Check out example tasks for reference.

How to contribute

The Terminal-Bench team follows a curated contribution process to maintain quality:

  1. Connect — Join the Discord, introduce yourself in #tb-science, and pitch your task idea in #tb-science-task-ideas for early feedback. Follow #tb-science-announcements for updates and weekly meetings (Mondays, 9am PT).
  2. Propose — When you’re ready, submit your idea via the Task Proposal Form. Proposals are posted to the Task Proposal Board and in #tb-science-task-proposals. An LLM judge evaluates it against the Task Proposal Rubric, and human reviewers use that to approve your proposal and guide you toward implementation.
  3. Build — Once approved, build the task in the Harbor Task Format and submit a Pull Request following the Contributing Guide. Your implementation is evaluated against the Task Implementation Rubric, and human reviewers also assess difficulty, scientific quality, and overall fit. Review and iteration continue until the task is ready to merge.

Once merged, the Terminal-Bench team runs frontier AI agents against all merged tasks to calibrate difficulty. Tasks that pass are included in the official Terminal-Bench Science release on the Terminal-Bench Benchmarks and Terminal-Bench Leaderboards.

Deadline

Tasks must be submitted and merged by August 17, 2026. Starting early is highly recommended — most tasks require a few rounds of feedback and iteration before they’re ready to merge.

Resources

Join the Discord and reach out to @stevendi11 on Discord or stevendi@stanford.edu to get involved. Key channels: #tb-science for general discussion, #tb-science-announcements for project updates, #tb-science-task-ideas for quick early feedback on ideas, and #tb-science-task-proposals for submitted proposals, automated reviews, and reviewer feedback. Plus, you can join the weekly meeting every Monday at 9am PT.

Acknowledgements

Terminal-Bench Science is an open academic collaboration hosted by Stanford University and the Laude Institute. As part of the Terminal-Bench franchise, it is built by the Terminal-Bench & Harbor Framework team, and scientific contributors, including Snorkel AI for support via the Open Benchmarks Grants program.

Frequently Asked Questions

Terminal-Bench Science is a benchmark that evaluates AI agents on real computational workflows from scientific research. It extends Terminal-Bench, the benchmark adopted by frontier labs including Anthropic, OpenAI, and Google DeepMind, to the natural sciences, with tasks run in containerized environments and checked through programmatic verification.

Practicing scientists across the natural sciences (life sciences, physical sciences, earth sciences, and beyond) are invited to contribute real computational workflows as benchmark tasks. You do not need to be affiliated with the core team; the project is an open academic collaboration.

Strong tasks are scientifically grounded (drawn from real research workflows), objectively verifiable (checkable through deterministic pytest-based evaluation), and genuinely difficult, targeting roughly a 10-20% solve rate for today’s most capable AI agents at release.

Yes. Contributors whose tasks are merged receive co-authorship on the Terminal-Bench Science paper, along with hands-on experience evaluating frontier AI agents.

Tasks must be submitted and merged by August 17, 2026. Starting early is recommended, since most tasks need a few rounds of feedback before they are ready to merge.

Snorkel AI supports Terminal-Bench Science through its Open Benchmarks Grants program, which funds open-source AI research and rigorous, reproducible evaluation.

For models that need to be right. Not just good enough.