Reference-specific unlearning metrics can hide the truth: A reality check
Abstract
Evaluating the effectiveness of unlearning in large language models (LLMs) remains a key challenge, especially as existing metrics often rely on specific reference outputs. The widely used forget quality metric from the TOFU benchmark compares likelihoods over paraphrased answers but is highly sensitive to the choice of the reference answers, potentially obscuring whether a model has truly forgotten the targeted information. We argue that unlearning should instead be assessed via distributional equivalence—how closely an unlearned model aligns functionally with the retain-only model. To this end, we propose Functional Alignment for Distributional Equivalence (FADE), a novel distribution-level metric that compares two distributions of textual outputs. FADE provides a more robust, principled approach to evaluating unlearning by comparing model behavior beyond isolated responses.