Extracting information from 10-Ks documents for KYC
Financial institutions are obligated by government policies to carry out customer due diligence as part of customer onboarding. For example, the U.S. Department of the Treasury Financial Crimes Enforcement Network (FinCEN) requires covered financial institutions to identify and verify the identity of beneficial owners of legal entity customers as one of the measures to comply with Anti-Money Laundering (AML) and Anti-Terrorist Financing (ATF) laws and regulations.
Challenge
Analysts and investment managers at a global custodial bank spend over 10,000 hours manually reviewing and transcribing 10-Ks as part of their thorough Know Your Customer(KYC) process. Buried within these PDFs is critical customer information that enables the bank to verify a company’s identity, establish a risk profile, and inform multiple business processes. Because 10-Ks come in various formats, this bank has hundreds of analysts spending 30-90 minutes per doc and process over 10,000 documents each year. Suppose a critical bit of information is missing or incorrect. In that case, analysts have to hunt it down, which lengthens the customer onboarding process and allows competitors to swoop in.
The bank initially attempted to solve the problem using a rule-based system, but it was rigid. It could only identify a narrow scope of information for certain document formats/layouts. Constant changes in regulations across several regions made the system overly complex, and it required frequent updates that took months to implement. The team saw an opportunity to leverage machine learning to extract the data using deep learning models. Still, the time required to manually label a high-quality training dataset was going to be arduous. The data labeling required internal subject matter expertise and couldn’t be outsourced. This meant the data science team needed to collaborate closely with subject matter experts and analysts to accurately extract the information.
Additionally, the team needed to work in close collaboration with subject matter experts and analysts to accurately extract the information from a wide variety of document formats. But given that the ML development lifecycle was siloed into data labeling and model training phases, it wasn’t easy to improve the data and model quality systematically to scale.
- Lack of adaptability to inevitable changes in objectives or production data.
- Time to label training data manually was a bottleneck to building AI to automate this effort.
- Poor collaboration between domain experts and data scientists made it difficult to solve ambiguous labels.
Goal
Train an adaptable AI application to extract information from industry documents faster by reducing the time and manual effort to label a high-quality dataset.
Solution
Using Snorkel Flow, the bank’s data scientists, data engineers, ML engineers, and SMEs came together to build a high-quality AI-based solution that saved the team 10,000+ hours and hundreds of thousands of dollars in costs associated with hand-labeling data. The team attained an +86 F1 macro score for risk profile with just 25 hours of SME time. In a few short weeks, the team created an AI application, using Snorkel Flow’s data-centric AI platform, that takes PDFs of 10-Ks and extracts 50+ different attributes such as nature of business, location, key senior managers, and more from tables, raw text, and multi-page PDF documents.
Next, their application classified extracted entities and carried out document-level aggregation before outputting all data to a structured tabular format for advanced analytics downstream.
Using Snorkel Flow, the team was able to overcome the challenges listed above by:
- Ensured adaptability with rapid code edits to labeling functions, not wholesale manual relabeling.
- Programmatic labeling scaled their ability to label complex, domain-specific text as training data.
- Improved collaboration between domain experts and data scientists across labeling, troubleshooting, and iteration.
The team was ultimately able to speed up the delivery of their KYC solution by streamlining the process of creating high-quality training datasets. They also benefited from collaborating with data scientists and subject matter experts (SMEs) in a point-and-click graphical user interface (GUI) environment.
Instead of extracting data by hand and labeling it, their team of experts can devote more time to getting clients on-boarded sooner. Snorkel Flow also gives this bank the flexibility it needs to stay compliant with the latest regulations and avoid expensive fines stemming from errors in the data.
10,000
hours saved for investment managers
1-3s
vs. 30-90 minutes to detect 50+ custom attributes