Vincent Sunn Chen

OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limita-tions of frontier agents. We introduce OSWORLD 2.0, a benchmark of 108 long-horizoncomputer-use workflows across everyday and professional tasks, designed to capturecomplex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete andrequires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking,compared with about 30 in OSWORLD 1.0. OSWORLD 2.0 targets challenge phenomenathat are common in real workflows yet underrepresented in...

Research Paper

OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limita-tions of frontier agents. We introduce OSWORLD 2.0, a benchmark of 108 long-horizoncomputer-use workflows across everyday and professional tasks, designed to capturecomplex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a…

Jun 26, 2026 •

XLANG Lab and contributions from Snorkel AI’s Zhengyang Qi, Vincent Sunn Chen, and Frederic Sala

Learn more about OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Blog

Benchtalks #3: We taught AI everything except how to learn

For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration…

Jun 25, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #3: We taught AI everything except how to learn

Agents’ Last Exam

Recent AI systems have achieved strong results on a wide range of benchmarks, yetthese gains have not translated into economically meaningful deployment acrossmany professional domains. We argue that this gap is largely an evaluation problem:widely used benchmarks lack sustained performance measurement on real andeconomically valuable workflows. This paper introduces Agents’ Last Exam(ALE), a benchmark designed to evaluate AI agents on long horizon, economicallyvaluable, real world tasks with verifiable outcomes. Developed in collaborationwith 250+ industry experts, ALE covers non-physical industries defined withreference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It isorganized around a task taxonomy with 55 sub...

Research Paper

Agents’ Last Exam

Recent AI systems have achieved strong results on a wide range of benchmarks, yetthese gains have not translated into economically meaningful deployment acrossmany professional domains. We argue that this gap is largely an evaluation problem:widely used benchmarks lack sustained performance measurement on real andeconomically valuable workflows. This paper introduces Agents’ Last Exam(ALE), a benchmark designed to evaluate AI agents on…

Jun 08, 2026 •

Yiyou Sun, Dawn Song, et al. (UC Berkeley RDI) with contributions from Snorkel AI’s Amanda Dsouza and Vincent Sunn Chen

Learn more about Agents’ Last Exam

Blog

Benchtalks #2: The future of coding benchmarks

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel…

Jun 03, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #2: The future of coding benchmarks

Blog

Benchmarks should shape the frontier, not just measure it

Since launching the Open Benchmarks Grants, we’ve received more than 100 applications from academic groups and industry labs spanning a wide range of domains and capabilities. As the best benchmarks drive how the field allocates research effort, the bar for benchmarks has risen as well. Here, we share what’s now table stakes for useful benchmarks, and what separates the ones…

Apr 07, 2026 •

Vincent Sunn Chen

Learn more about Benchmarks should shape the frontier, not just measure it

Blog

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

To kick off our inaugural Benchtalks, a series dedicated to the researchers building these measurement toolkits, Snorkel AI co-founder Vincent Sunn Chen sat down with Alex Shaw, Founding MTS at Laude Institute and co-creator of Terminal-Bench and Harbor. Highlights More on Terminal-Bench: See the leaderboard and the catalog of tasks at tbench.ai. Explore Harbor: Learn how to scale your agent…

Mar 31, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Blog

Closing the Evaluation Gap in Agentic AI

Today, AI is marked by a growing asymmetry: the excitement around agentic AI is real — backed by quantitative progress on model cards and genuine leaps forward, especially in coding. But ask individuals or enterprises where they feel ready to deploy agentic automation in high-stakes, domain-specific settings outside of coding… and you will find hesitation. The reason: our ability to…

Feb 11, 2026 •

Vincent Sunn Chen

Learn more about Closing the Evaluation Gap in Agentic AI