Building Industrial-Strength NLP Applications with Ines Montani
In this episode of Science Talks, Explosion AI’s Ines Montani sat down with Snorkel AI’s Braden Hancock to discuss her path into machine learning, key design decisions behind the popular spaCy library for industrial-strength NLP, the importance of bringing together different stakeholders in the ML development process, and more.
This episode is part of the #ScienceTalks video series hosted by the Snorkel AI team. You can watch the episode here:
Below are highlights from the conversation, lightly edited for clarity:
How did you get into machine learning?
Ines: I have always been into computers, and I spent time making websites as a teenager. However, I did not choose to go into computer programming at first. Instead, I studied media science and linguistics; and then worked in media for a while.
It took me a while to find an area that lets me combine all the things I’m good at. I discovered that in Natural Language Processing (NLP), which combines my passion for software development and languages. That started when I met my co-founder Matt Honnibal, also the original author of spaCy.
I think Machine Learning is a good field for people without a stereotypical computer science background. They come from another field with specific problems that they want to solve, and then they teach themselves machine learning to solve that problem.
In what ways are models for real-world problems not one-size-fits-all?
Ines: Most valuable and interesting things you can do with ML are specific. Companies have a lot of text data and want to find out specific insights to improve their processes, make them more money, and solve their problems. Often, you can download models of the Internet, which is okay but more of a commodity. The things that make a difference are those specific to the business activities and problems—your task-specific training sets.
Real-world problems do not often neatly fit into an end-to-end prediction system. Instead, they consist of multiple parts. The real hard part we see people struggling the most with is taking this problem and breaking it down into solvable ML components—where you can train, evaluate, interpret, debug, and stitch together.
For example, you have a bunch of text documents and an internal database. You want to populate that database based on all text documents collected over the past 20 years. Based on that database, you can assess whether you should do X or Y. This problem is not something that can be solved with a huge language model with a ton of parameters.
“You want to combine different heuristics and sources of supervision into a single problem to get the best results.”
Maybe you can extract relevant document clusters, maybe you can use some manual processes here and there, or maybe you can add regular expressions written 10 years ago that still work well. Those are the types of work that we see currently as most valuable for our users.
How do you think about bringing different user profiles into the ML development process?
Ines: I think you can separate the development part from the business logic part. Of course, it won’t work if the ML engineer just builds the whole thing without knowing anything about the business. On the other hand, ML is still an abstract concept that requires a fair amount of expertise and fundamental understanding.
“The ultimate solution is to bring the subject matter experts and the developers to build the product together. I think that’s the right formula for success.”
That’s also where the “Let Them Write Code” idea comes up. I often find it very offensive when people divide the world into coders and non-coders. Fundamentally, if you look at people who don’t program in their jobs, they still use complex work tooling. If you look at what an architect, a lawyer, or an accountant does, it’s not like they just sit there and press a button. People do want good tooling. That means programming for some and using tools built by someone else for others.
I don’t think we need to oversimplify things or put weird abstractions on top of the tools. We just need to find a way to expose the capabilities of machine learning sensibly to everyone.
How do you prioritize feature requests for spaCy?
Ines: We have always tried to keep a very tight scope on the library. That helps us prioritize what features to develop. spaCy has always been designed to be used in production for real-world use cases. Thus, we are opinionated about having a small number of ways to accomplish things. It is a library for information extraction from text, so we wouldn’t put other NLP capabilities in the core library, such as text generation.
Most software products and libraries are built by very small teams, even if you look at projects at very large organizations. I think the same could be applied to open-source projects.
How did the “Let Them Write Code” design philosophy come about?
Ines: It echoed how people used the library and how we wanted to use the software ourselves. The idea is that not everyone needs to write code; instead, they can make smart API choices that simplify things.
I often feel like people mistake this idea for just putting an abstraction (like a click of a button or a function) on top of something complex. The main difference here is whether that abstraction makes sense and lets you program correctly. Don’t wrap a complex method up just to say that iit’s a one-liner. If it requires more steps to do right, let it remain as multiple steps.
What makes interactive visualizations so effective?
Ines: When we started building these visualizations, we did have developers in mind. There’s always a difference between printing a bunch of stuff and actually looking at a visualization of the model outputs. These visualizations are also helpful to break down abstract things and make them more accessible.
More importantly, these demos should give you an idea of what is going on under the hood so that you don’t treat AI like magic. For a developer tool to be used in machine learning applications, it’s crucial to show the users how decisions are made along the way.
Explosion has several libraries that touch the ML pipeline—spaCy, Prodigy, and Thinc. How do they fit together?
Ines: Prodigy was our first commercial product built on top of spaCy. When we started the company, we decided not to raise money and did a few months of consulting instead to get the company off the ground. Annotating and creating training data came up literally in every conversation, as most people did not take this seriously. They experimented with Mechanical Turk and were surprised that their models weren’t doing well at all.
Thinc is our own ML library that powers spaCy. Early on, when we tried out PyTorch and TensorFlow, they weren’t stable yet. We didn’t want to depend on a large library that kept changing frequently, so we built something lightweight ourselves. Obviously, the landscape looks very different now. We built the recent version of Thinc as a meta library for model composition so that it’s easy for spaCy to support models written in PyTorch, TensorFlow, MXNet, vanilla NumPy, CuPy, and more via a shared API.