Data development

Events

,

Research

2026: The year of environments

December 10, 2025

•

4 min read

•

Snorkel Team

Our NeurIPS 2025 retrospective

We just returned from NeurIPS 2025, and we’re still processing everything we saw. The energy around data-centric AI has never been stronger—and we couldn’t be more grateful to the research community for pushing these ideas forward.

The evolution we’ve witnessed

When we first brought Snorkel AI research to NeurIPS back in 2019, data-centric AI barely registered as a topic. Fast forward to 2025, and there’s an entire section of the conference floor dedicated to it. That kind of shift doesn’t happen by accident—it’s the result of countless researchers taking stock of the central role of top-quality data in realizing the best outcomes with AI.

What stood out this year

A few themes dominated the conversations we had and the talks we attended.

2026 will be the year of environments. Through talks like Aksel Joonas Reedi’s presentation on OpenEnv, Mike Merrill’s discussion of Terminal-Bench 2.0, and Grégoire Mialon’s discussion of ARE, we observed that the community is getting serious about building diverse, scalable environments for evaluations and RL. The insight that environments provide a natural curriculum for scaling complexity feels like it’s going to shape a lot of work in 2026. Noteworthy papers include:

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers—https://arxiv.org/abs/2508.20453
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks— https://arxiv.org/abs/2506.11791
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks— https://arxiv.org/abs/2412.14161
DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments— https://arxiv.org/pdf/2506.00739

Data still needs human expertise. While tools and techniques are naturally vital, the trend that stands out is a greater recognition that data quality has a make-or-break impact on achieving desirable results, and working with human experts is still the best way to deliver top-quality data. We found some very interesting datasets among the accepted papers this year:

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages—https://arxiv.org/abs/2505.11475
Sheetpedia: A 300K-Spreadsheet Corpus for Spreadsheet Intelligence and LLM Fine-Tuning—https://openreview.net/pdf?id=4vLYwlA3X5

Rubrics are getting more principled. We saw exciting work on more systematic factorization of evaluation criteria, new human-in-the-loop paradigms for data development, and frameworks for continual learning. In Liangchen Luo’s talk, How to Develop in the Agentic Era, the emphasis on building evals before training strongly reinforces the notion that well-written rubrics and evaluation criteria are of utmost importance. Two papers of note here:

Measuring what Matters: Construct Validity in Large Language Model Benchmarks— https://arxiv.org/abs/2511.04703
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents—https://arxiv.org/abs/2505.20411

Our events

Snorkel AI cofounder and CEO Alex Ratner, cofounder and Chief Scientist Fred Sala, and the broader Snorkel research team hosted an intimate evening of whiskey, small bites, and research-driven conversation at The Whiskey House in San Diego. We’re so grateful for everyone who joined us!

SEA Workshop (sponsorship)

We want to thank the SEA (Scaling Environments for Agents) workshop organizers for an excellent day, with highly engaging invited talks, and poster sessions that drew a great deal of interest. We were pleased to sponsor this event, along with our other Diamond sponsor Inclusion AI, and Platinum sponsors Vmax and Sonic Jobs.

Award winners

Outstanding papers:

GEM: A Gym for Agentic LLMs—https://arxiv.org/abs/2510.01051
RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines—https://arxiv.org/abs/2502.00595

Outstanding posters:

Go-Browse: Training Web Agents with Structured Exploration—https://arxiv.org/abs/2506.03533
Scaling Open-Ended Reasoning to Predict the Future

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

Thank you

To everyone who shared their work, challenged our thinking, and stopped by to chat—thank you. The progress in this field happens because researchers are willing to publish their failures alongside their successes, and build on each other’s ideas.

We’re heading into 2026 energized by what we saw. If the trends at NeurIPS are any indication, it’s going to be a big year for environments, evaluation, and data-centric approaches to AI development.See you at the next one. And in the meantime, if you’re interested in collaborating with us on building impactful environments or need expert-verified data developed in agent environments, come talk to us!

Share this article

Recommended articles

View all articles

Claude Opus 5: Performance and Error Analysis on Frontier Coding Tasks

Anthropic’s Claude Opus 5 recently debuted as the second model overall on the current Senior SWE-bench leaderboard, behind Fable 5. It also achieves the highest score of any evaluated model on the benchmark’s Bug & Performance Investigation category, reinforcing the rapid progress frontier coding models continue to make on increasingly realistic software engineering tasks. Just as notable, Opus 5 reaches

July 27, 2026

•

Ankit Aich

Senior SWE-Bench: Evaluating Coding Agents Like Senior Engineers

At our latest Snorkel AI Reading Group, Henry Ehrenberg presented Senior SWE-Bench, an open-source, Harbor-compatible benchmark for evaluating coding agents on realistic, senior-level software engineering work. Its 100 tasks, with 50 public and 50 kept private to mitigate contamination, are sourced from real pull requests across 12 production repositories and cover complex features, migrations, bugs, and performance issues. Senior SWE-Bench

July 16, 2026

•

Snorkel Team

Grok 4.5 Testing Results: How SpaceXAI’s New Model Performs on Real Professional Work

We’ve evaluated Grok 4.5 on Snorkel’s GDPval+ dataset, Snorkel’s expert-created dataset of professional workplace reasoning tasks from across the economy. To compare performance against other frontier models, we ran the evaluation alongside GPT 5.5 and Claude Opus 4.8. Overall, Grok 4.5 demonstrated the strongest overall performance. Dataset GDPval+ is part of the Snorkel Data Series (SDS), Snorkel’s portfolio of expert-curated

July 8, 2026

•

Jacob Fleisig

2026: The year of environments

The evolution we’ve witnessed

What stood out this year

Our events

SEA Workshop (sponsorship)

Award winners

Thank you

Recommended articles

Join our newsletter

How do you want to work with Snorkel?

2026: The year of environments

The evolution we’ve witnessed

What stood out this year

Our events

Snorkel Social

SEA Workshop (sponsorship)

Award winners

Thank you

Recommended articles

Join our newsletter

How do you want to work with Snorkel?