Evaluation Archives

Capabilities

How it works

Research-led data and environment development for the frontier's hardest problems

Learn more

Data development

Overview

Expert-curated datasets for frontier AI

Use cases

See how our data improves frontier models

Specialized agents

Overview

Custom AI systems built to unlock ROI fast

Enterprise stories

Real-world results from enterprise deployments
Research

Research

Research hub

Our latest papers and data-centric AI findings

Leaderboards

Compare model performance across benchmarks

Open Benchmarks Grants

Funding for open-source AI research

FEATURED BENCHMARK

Agents' Last Exam

A benchmark for evaluating AI agents on long-horizon, economically valuable professional workflows with verifiable outcomes, built with Berkeley RDI.

Explore the leaderboard
Resources

Resources

Resource library

Guides, papers, and tools for data-centric AI

Events

Upcoming talks, workshops, and conferences

Reading Group

AI discussions for researchers and practitioners

Blog

News, updates, and perspectives from our team

Featured event

Frontier Data Summit | October 8

A one-day, invite-only summit, providing a first look at the benchmarks and research that will shape the frontier.

Call for posters
Company

Company

About

Our mission, story, and values

Careers

Open roles and life at our company

Press

Media resources and announcements

Partners

Organizations we work with

Security

How we keep data safe

Contact us

Get in touch with our team

Join our expert community

Get paid to shape safer, smarter AI

Learn more
Get started

Evaluation

AI evaluation systematically measures a model’s performance on tasks. Classically, this applied metrics like accuracy or precision to clear and discrete numerical or categorical targets. Moden evaluation also assesses the output of generative models to ensure they create content within an organization’s standards and guidelines.