Developing and Managing Systems to Extract Structured Data
Machine Learning Whiteboard (MLW) Open-source Series
Earlier this year, we started our machine learning whiteboard (MLW) series, an open-invite space to brainstorm ideas and discuss the latest papers, techniques, and workflows in the AI space. We emphasize an informal and open environment to everyone interested in learning about machine learning.
In this episode, Manan Shah dives into “Glean: Structured Extractions from Templatic Documents,” authored by Sandeep Tata, Navneet Potti, James B. Wendt, Lauro Beltrao Costa, Marc Najork, and Beliz Gunel, which was presented at VLDB 2021, and “Representation Learning for Information Extraction from Form-like Documents” by Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, Marc Najork presented at ACL 2020.
This episode is part of the #MLwhiteboard video series hosted by Snorkel AI. Check out the episode here:
Glean: Structured Extractions from Templatic Documents
Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones.
We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges :
- Managing the quality of ground truth data.
- Generating training data for the machine learning model using labeled documents.
- Building tools that help a developer rapidly build and improve a model for a given document type.
Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.
Representation Learning for Information Extraction from Form-like Documents
We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images. We propose an extraction system that uses knowledge of the types of the target fields to generate extraction candidates and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains but are also interpretable, as we show using loss cases.
If you are interested in learning with us, please consider joining us at our biweekly ML whiteboard.
If you are interested in learning with us, consider joining us at our biweekly ML whiteboard.
If you’re interested in staying in touch with Snorkel AI, follow us on Twitter, LinkedIn, Facebook, Youtube, or Instagram, and if you’re interested in joining the Snorkel team, we’re hiring! Please apply on our careers page.