Diving Into SliceLine – Machine Learning Whiteboard (MLW) Open-source Series
Earlier this year, we started our machine learning whiteboard (MLW) series, an open-invite space to brainstorm ideas and discuss the latest papers, techniques, and workflows in the AI space. We emphasize an informal and open environment to everyone interested in learning about machine learning.In this episode, Kaushik Shivakumar dives into “SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging,” author by Svetlana Sagadeeva and Matthias Boehm, presented at SIGMOD 2021, receiving a Best Paper Award for Data Science.This episode is part of the #MLwhiteboard video series hosted by Snorkel AI. Check out the episode here:
Abstract:
Slice finding—a recent work on debugging machine learning (ML) models—aims to find the top-K data slices (e.g., conjunctions of predicates such as gender female and degree Ph.D.), where a trained model performs significantly worse than on the entire training/test data. These slices may be used to acquire more data for the problematic subset, add rules, or otherwise improve the model. In contrast to decision trees, the general slice finding problem allows for overlapping slices. The resulting search space is huge as it covers all subsets of features and their distinct values. Hence, existing work primarily relies on heuristics and focuses on small datasets that fit in the memory of a single node. In this paper, we address these scalability limitations of slice finding in a holistic manner from both algorithmic and system perspectives. We leverage monotonicity properties of slice sizes, errors, and resulting scores to facilitate effective pruning. Additionally, we present an elegant linear-algebra-based enumeration algorithm, which allows for fast enumeration and automatic parallelization on top of existing ML systems. Experiments with different real-world regression and classification datasets show that effective pruning and efficient sparse linear algebra renders exact enumeration feasible, even for datasets with many features, correlations, and data sizes beyond single node memory.
If you are interested in learning with us, consider joining us at our biweekly ML whiteboard.If you’re interested in staying in touch with Snorkel AI, follow us on Twitter, LinkedIn, Facebook, Youtube, or Instagram, and if you’re interested in joining the Snorkel team, we’re hiring! Please apply on our careers page.