resources

Resource library

Explore our complete library of resources including blogs, benchmarks, research papers, and more.

Blog

Grok 4.5 Testing Results: How SpaceXAI’s New Model Performs on Real Professional Work

Announcing a $3M commitment to launch Open Benchmarks Grants

Jacob Fleisig

July 8, 2026

Blog

Why coding agents need better data, evals, and environments

Announcing a $3M commitment to launch Open Benchmarks Grants

Justin Bauer

May 11, 2026

Blog

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants

Vincent Sunn Chen

February 11, 2026

Blog

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Announcing a $3M commitment to launch Open Benchmarks Grants

Kobie Crawford

September 30, 2025

Blog

Building FinQA: An Open RL Environment for Financial Reasoning Agents

Announcing a $3M commitment to launch Open Benchmarks Grants

Bhavishya Pohani

March 30, 2026

Blog

The science of rubric design

Announcing a $3M commitment to launch Open Benchmarks Grants

Charles Dickens

September 11, 2025

Blog

Benchtalks #3: We taught AI everything except how to learn

Featuring Parth Asawa (Continual Learning Bench)

Vincent Sunn Chen

June 25, 2026

Claude Opus 5: Performance and Error Analysis on Frontier Coding Tasks

Blog

NEW

Claude Opus 5: Performance and Error Analysis on Frontier Coding Tasks

Anthropic’s Claude Opus 5 recently debuted as the second model overall on the current Senior SWE-bench leaderboard, behind Fable 5. It also achieves the highest score of any evaluated model on the benchmark’s Bug & Performance Investigation category, reinforcing the rapid progress frontier coding models continue to make on increasingly realistic software engineering tasks. Just as notable, Opus 5 reaches…

Jul 27, 2026 •

Ankit Aich

Learn more about Claude Opus 5: Performance and Error Analysis on Frontier Coding Tasks

Blog

Senior SWE-Bench: Evaluating Coding Agents Like Senior Engineers

At our latest Snorkel AI Reading Group, Henry Ehrenberg presented Senior SWE-Bench, an open-source, Harbor-compatible benchmark for evaluating coding agents on realistic, senior-level software engineering work. Its 100 tasks, with 50 public and 50 kept private to mitigate contamination, are sourced from real pull requests across 12 production repositories and cover complex features, migrations, bugs, and performance issues. Senior SWE-Bench…

Jul 16, 2026 •

Snorkel Team

Learn more about Senior SWE-Bench: Evaluating Coding Agents Like Senior Engineers

Blog

Grok 4.5 Testing Results: How SpaceXAI’s New Model Performs on Real Professional Work

We’ve evaluated Grok 4.5 on Snorkel’s GDPval+ dataset, Snorkel’s expert-created dataset of professional workplace reasoning tasks from across the economy. To compare performance against other frontier models, we ran the evaluation alongside GPT 5.5 and Claude Opus 4.8. Overall, Grok 4.5 demonstrated the strongest overall performance. Dataset GDPval+ is part of the Snorkel Data Series (SDS), Snorkel’s portfolio of expert-curated…

Jul 08, 2026 •

Jacob Fleisig

Learn more about Grok 4.5 Testing Results: How SpaceXAI’s New Model Performs on Real Professional Work

Case study

From hours to seconds on CLO contract review with 94% end user acceptance

A top 10 US bank manages CLO portfolios totaling billions in assets, each governed by contracts up to 500 pages.

Jul 01, 2026 •

Snorkel Team

Learn more about From hours to seconds on CLO contract review with 94% end user acceptance

Case study

Conversational, decision-grade responses in 15 seconds

A global media intelligence firm analyzes hundreds of millions of sources daily – from public news, social, and broadcast to proprietary analyst-curated databases – to help large enterprise clients manage communications, reputation, and strategic decision-making. Their competitive advantage is the layer on top of publicly available data: in-house human editorial teams, proprietary scoring and analytics frameworks, and years of analyst judgment refined into decision-grade intelligence. When a crisis signal is building or a competitor’s narrative is gaining traction, speed and accuracy matter enormously. Historically, getting an answer meant waiting for a human analyst to manually aggregate across those sources: a process measured in hours, not seconds.

Jul 01, 2026 •

Snorkel Team

Learn more about Conversational, decision-grade responses in 15 seconds

Blog

Agents’ Last Exam: AI Benchmarking for Real Work

At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can…

Jun 30, 2026 •

Snorkel Team

Learn more about Agents’ Last Exam: AI Benchmarking for Real Work

Blog

Continual learning and evaluating how AI agents learn across sequences of tasks

Most agent benchmarks evaluate each task as an independent episode. The agent receives a task, produces an answer, gets scored, and moves on. The next task starts as if the previous one never happened. That setup misses a core requirement for deployed agents. A coding agent, research assistant, data analyst, or workplace assistant should improve as it works across repeated…

Jun 29, 2026 •

Chris Glaze

Learn more about Continual learning and evaluating how AI agents learn across sequences of tasks

OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limita-tions of frontier agents. We introduce OSWORLD 2.0, a benchmark of 108 long-horizoncomputer-use workflows across everyday and professional tasks, designed to capturecomplex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete andrequires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking,compared with about 30 in OSWORLD 1.0. OSWORLD 2.0 targets challenge phenomenathat are common in real workflows yet underrepresented in...

Research Paper

OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Jun 26, 2026 •

XLANG Lab and contributions from Snorkel AI’s Zhengyang Qi, Vincent Sunn Chen, and Frederic Sala

Learn more about OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Blog

Benchtalks #3: We taught AI everything except how to learn

For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration…

Jun 25, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #3: We taught AI everything except how to learn

1 2 … 66

Join our newsletter

For expert advice, the latest research, and exclusive events.

By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Resource library

Grok 4.5 Testing Results: How SpaceXAI’s New Model Performs on Real Professional Work

Why coding agents need better data, evals, and environments

Closing the Evaluation Gap in Agentic AI

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Building FinQA: An Open RL Environment for Financial Reasoning Agents

The science of rubric design

Benchtalks #3: We taught AI everything except how to learn

Join our newsletter

How do you want to work with Snorkel?