Benchmarks for what frontier AI hasn't solved
Agents’ Last Exam
Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.
Continual Learning Bench
Evaluates whether AI systems improve from prior experience across sequential, stateful tasks, measuring real in-context learning, not just raw capability.
SlopCode Bench
Measures code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat), and verbosity under realistic repo conditions.
Open Benchmarks Grants
featured Collaborations
Computer tasks
Natural sciences
Legal agents
Harvey’s Long Horizon Legal Agent Benchmark
Built to evaluate and improve agent capabilities for supporting legal work.
Evaluation methods
JudgmentBench
Compares rubric-based and preference-based evaluation for judging output quality.
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.
Three core dimensions where today's benchmarks fall short



