Ahmad Khan, head of artificial intelligence and machine learning strategy at Snowflake gave a presentation entitled “Scalable SQL + Python ML Pipelines in the Cloud” about his company’s Snowpark service at Snorkel AI’s Future of Data-Centric AI virtual conference in August 2022. The following is a transcript of his presentation, edited lightly for readability.
Welcome everybody. So we’re going to be talking about scalable SQL + Python ML pipelines. This is a key area that enables organizations to take their ML experiments into production. And so in this talk today, I’m going to be talking about operational challenges in ML.
Everybody can train a model. The difficult part is what comes before training a model and then after. And in this talk, we’re going to be focusing on the before part. What’s really important in the before part is having production-grade machine learning data pipelines that can feed your model training and inference processes. And that’s really key for taking data science experiments into production.
So we’ll talk about that, then we’ll move over and talk about how we at Snowflake view data preparation for machine learning, feature engineering, and data engineering, and how our newly released capability of Snowpark lets you define all this logic in Python rather than SQL.
So we’ll learn a little bit about that, and then we’ll discuss an example of how you can leverage Scikit-Learn within Snowflake and Snowpark to implement some of these feature engineering techniques and also do machine learning model training and inference.
And finally, you’ll see that in action today. I don’t have a lot of time, so we’ll jump into it. It won’t be a long demo, it’ll be a very quick demo of what you can do and how you can operationalize stuff in Snowflake.
So when we talk to organizations, the ones that are really reaping the benefits of their investments in AI and machine learning are the ones that are doing this at scale—that have perhaps thousands of models in production. Almost all their business decisions are being aided by decisions made by AI and ML systems, be it automated decisions or hybrid that are in conjunction with humans. But all of them that are really reaping the benefits are running AI at scale.
However, what we see is most organizations are not reaping the benefits, and they’re running into challenges. And one of the biggest challenges that we see is taking an idea, an experiment, or an ML experiment that data scientists might be running in their notebooks and putting that into production. And so that’s the key challenge that we see in a typical ML workflow—you’llsee these two iterative cycles of experimentation and production.
The data scientists will start with experimentation, and then once they find some insights and the experiment is successful, then they hand over the baton to data engineers and ML engineers that help them put these models into production. And for most organizations that are struggling with this, what ends up happening is that these two cycles are happening in two different environments. And it might be that these are two totally separate data environments and a lot of times they’re separate for compute processing as well. And so data scientists might be leveraging one compute service and might be leveraging an extracted CSV for their experimentation. And then the production teams might be leveraging a totally different single source of truth or data warehouse or data lake and totally different compute infrastructure for deploying models into production.
And this is where organizations get stuck; this is why this handoff from experimentation to production is so challenging because you have different environments that the teams are working in; not only are the teams different, but they’re working in different environments as well.
And this is not just us saying it. You can see, this is a study that was done by Forrester back in 2020, and the key piece there is 14%. And I don’t think it’s changed that much. I would still venture to guess that less than 20% of organizations have a repeatable and defined process of going from machine learning experiments to a model of deployment and production that can then serve a downstream machine learning-enabled application.
So how are we thinking about this problem at Snowflake? This is the key question at Snowflake because we believe data is the most important ingredient for AI and ML workloads. And we view Snowflake as a solid data foundation to enable mature data science machine learning practices. And how we do that is by letting our customers develop a single source of truth for their data in Snowflake. And so that’s where we got started as a cloud data warehouse.
But now with the evolution of Snowflake into a data platform, it’s not just a data repository, it’s not just running reports on your data, but it’s also letting you bring in all sorts of data—structured, semi-structured, unstructured data—and not only having your first-party data but very easily subscribing to third-party datasets through our marketplace operations.
And then once you have access to this data, you’ll be able to process that data, whether it’s for data preparation, feature engineering, or data engineering, but also for model training and defining inference pipelines in Snowflake as well. And doing all of that while giving you the best-in-class experience in terms of security and governance. Giving the right access to the right people in a timely fashion is another one of those challenges with machine learning that’s not talked about. So think of us as a solid data foundation and a processing platform that can really enable your machine learning practice.
Our vision of how we aid our customers with machine learning is by providing a singular platform, a singular environment on which they can do experimentation—so data scientists can leverage their favorite developer tools, be it IDEs, Jupyter, any kind of notebook environment—but also work with datasets that are cataloged, that are best of breed, high-quality datasets in Snowflake for their experimentation.
And then once they’re done with that, it’s very easy to package up, and you’ll see that in the demo today. For me as a data scientist or even as a data engineer, to take that code and put it into production within Snowflake and within one environment as well.
Let’s go and talk about machine learning pipelining. The most important thing is feature engineering, and there are a lot of terms here that overlap.
At a very high level, how we see feature engineering is, we tend to group them in two different buckets. They are derived and calculated features where you’re doing things like grouping, sums, maxes, and averages. An example could be the number of items bought by a customer in the last 30 days or mean time between transactions per user. So there’s a lot, and they tend to be compute-intensive. And because they are compute-intensive, we want our customers to persist them so they can be reused.
And the second bucket is what we call representational transforms. These are things like one-hot encoding where you’re going from a categorical variable to a one-hot encoded variable. And these are not really compute-intensive for most structured ML problems. And there, instead of materializing them in your database, you can just compute them on the fly.
A quick example here: I have a users table, I calculate the time since the last purchase, and from there I can calculate the average time between transactions for a particular zip code. And I have this feature that I’m then using in downstream machine learning applications and perhaps making a catalog or making it part of my feature store. So it makes sense for me—because I’m spending compute cycles—to materialize this and make it available so I’m not calculating it over and over again.
And so another example of representational transforms—again, one-hot encoding is a very common one where I have a categorical variable and I’m going from categorical to a one-hot encoded, and this is one of those things where you can just calculate this on the fly.
And how we view this—how it should be computed for derived and calculated features—we encourage our customers to use our SQL and Python APIs that are made available by the Snowpark library that can process this for large-scale data.
And then, you can persist those out to these feature tables that can then be used as feature stores. And you have one pipeline that’s calculating the feature and it’s running on a certain cadence. And that feature could be used by multiple machine learning workloads on the other side.
And then for representational transforms, we encourage our customers to really leverage the machine learning tools of their choice. A lot of them for structured data tend to use Scikit-Learn—so I’ll use that as an example in the demo today—to do things like encoding and binning.
And so there’s a lot of gray area. You can do a lot of this stuff in Snowpark as well, but for one-hot encoding and encoding, use the ML library. For calculated features, you can use Snowpark. So now let me talk about Snowpark a little bit, and give you an introduction to how we are enabling additional languages in Snowflake and how you can define these pipelines using data frames, UDFs, and store procedures.
Essentially, Snowpark is a combination of data frames, UDFs, and stored PROCs that let you define all your data engineering, data science, and feature engineering logic in Python against data in Snowflake. And so previously, prior to Snowpark, if you wanted to do Python programming, you had to run the Python program outside of Snowflake hosted somewhere else. And it would connect using our Snowflake connector to data in Snowflake, pull the data out of Snowflake, and then you’d process that outside of Snowflake. With Snowpark, I can run native Python, and with the use of our data frames, APIs, UDFs, and stored PROCs, I can run it all inside of Snowflake. I don’t need an external compute service to run this.
Going a little bit deeper—and I won’t spend too much time here—this is really a client/server model where you have your data frames library, which is a Python library that you pip install, and the Snowpark library that you pip install into your development environment. The other portion is your ability to run UDFs—so user-defined functions and store procedures that are written in Python—inside of the Snowflake compute service.
And the way this works is I can create a UDF, let’s say, called “predict.” Let’s say I have a pre-trained model and I want to take that and use that inside of a SQL query or data frame operation. And what the Snowpark client will do is serialize this UDF and push it into the Snowflake compute environment. And once it does that, it becomes usable in a Snowflake query or a data frame operation. For any kind of data frames that you write using Snowpark, the Snowpark client will eventually compile that down to a SQL query, and it is the SQL query that ends up running in a distributed fashion on Snowflake.
So here’s a quick example. I have all these cells and I have a UDF, and I can take that UDF and push it down into Snowflake. And then I can use that “predict” UDF in my data frame operation and it will lazily evaluate that and produce a SQL statement in the end that runs.
But as a data scientist, you’re doing this processing, and you’re writing all this logic in Python, and you don’t have to write it in SQL at all. And so I can write native Python functions, push them down, or write these data frame operations that get pushed down and get executed at scale in Snowflake.
And so then what I can do is automate these as pipelines. And so we provide these tasks that can execute on schedule. As new data’s coming in, your data will automatically get processed into the right kind of features. So it’s made available for training as well as inference pipelines downstream that you can also define in Snowflake.
I’ll come back to this, but let me quickly talk about Scikit-Learn here and what I’m going be using in the demo to implement this. So you’ve got these transformer objects that can transform the data (for example, one-hot encoding), I can train an estimator, which abstracts the machine learning algorithm.
Then once I train these transformers and estimators, I can actually package them up in a singular pipeline object on which I can input unseen data and I can get predictions back. And this is something that I’m going to be using from Scitkit-Learn and packaging it up as a UDF and stored PROC and pushing it down into Snowflake and running it at scale, all within Snowflake.
The demo is actually very simple. I’ve got a training dataset. I’m gonna be using a stored PROC to run large-scale training here. And then I can save the model training pipeline to a stage. And for inference, what I’m going to do, if I’m going to use a large dataset, I can read the saved pipeline or model training into UDFs and I can run the UDF at scale across a Snowflake cluster and then be able to write those predictions into a destination table in Snowflake as well.
Setting up an environment—so for this use case, I’m gonna be using VS code not Jupyter but you can set this up in Jupyter very easily as well. Then I set up a Conda environment, create the Conda environment, and simply select that Conda environment within VS code.
It’s a very simple machine learning example. And you’ve seen this where we’re trying to predict the median house value. So you’ve seen this probably hundreds of times before, but essentially we’ve got numerical features and a categorical feature, and there are nulls in there, so we want to do imputations to handle nulls. We want to do some scaling of the numerical features, and then we want to do one-hot encoding for the categorical feature. And then we want to input all of that into a random forest model and package that up as a pipeline and run this all in Snowflake.
So let’s quickly go and move over to the demo here. In the interest of time, I’m not going to run this, but I’ll show you the key parts of the code here. And so, in addition to my Scikit-Learn, I’m bringing the Snowpark library, which lets me define data frame operations. And then I make some stages to save my models.
And this is really where I develop and train my model. And so it’s essentially just plain Python code, and I train my model using Scikit-Learn. I have a full pipeline where I have a pre-processor and model. I fit that and then I write that out to a stage in Snowflake.
And all I have to do to push this down and put it into production is call this stored PROC utility method in Snowpark that pushes this train model function down in Snowflake. And then now I have a stored procedure that I can just schedule inside of Snowflake or schedule using Airflow or my favorite orchestrator tool out there.
And then once I have this, I can run this on a cadence. And as new data comes in, I’ve defined a UDF that can take that new data and run it against the model. And so here I’m doing the predictions in Snowflake where I can see the prediction and the actual value.
So I can run that. I can not only retrain the model on a cadence but also, as new data becomes available, I can run the machine learning pipeline in Snowflake as well. And so going back to our presentation over here, what I’m doing is, rather than having to learn—as a data scientist—ML infrastructure, just cloud infrastructure concepts such as, how to use containers, more infrastructure-related stuff like containers, I’m packaging up my code as UDFs and stored PROCs and just leveraging those building blocks in Snowflake to deploy that.
And all the third-party libraries are made available through our partnership with Anaconda. And everything is happening where the data lives. Rather than taking the data out of Snowflake and processing and doing machine learning training somewhere else, you’re running everything in one system. All of this can be automated to the point where, as your new data is arriving, not only are you able to train and retrain your models, but also run the inference pipelines all on a schedule as well.
So the code that I shared with you—I know I didn’t have a lot of time to run the code, but if you want to take a picture of this, there’s a whole demo out on GitHub that you can follow along. All you need is a Snowflake account and a VS code to run this. And now I know I’m running short on time, so I’ll open it up for any kind of Q&A.
Question and answer session
Priyal Aggarwal: Thank you so much. You talked about the trends that we have seen throughout the day in other talks as well, and really across the industry that it’s just not about two separate cycles of say, labeling and model development, but the harmony between the two processes. We have a few questions that I would like to now take. The first question is from Andrew. And they say, “How does Snowpark handle versioning of these code fragments and artifacts?”
Ahmad Khan: Yeah that’s a great question. So right now the way Snowpark is structured, it gives you the processing building blocks in Snowflake and a method to leverage the compute within Snowflake. The version control system remains the same. So if you’re using GitHub, or your favorite CICD tools out there, like Jenkins and others, you can still use those tools. And the beauty of this pattern is that if you have existing processes for CICD version control in your organization, you can just keep that as is. And even with the development environment, we didn’t want to go in and say, “you need to use our hosted notebooks.” You can use whatever development environment that you’re used to: VS code, IntelliJ, PyCharm, Jupyter Notebooks, everything is supported. And you can use your existing CICD and version control processes to keep track of this.
PA: Got it. Yeah, that makes a lot of sense. Thank you. The next question is from Roshni and she says, “How does Snowpark enable data-centric ML workflows where experimentation and production can be in the same environment?”
AK: Yeah, absolutely. So the core of this is the fact that your data is in one place. Snowflake becomes a single source of truth. So it’s not like you’re taking extracts out of the datasets. And so we have a lot of these data platform tools where you can create copies of your production data. Your data scientists could be working and experimenting on those zero-copy clones of datasets, and once they’re satisfied, they can run their tests and move that over to a production environment. The big benefit comes from the fact that it’s a shared dataset or data environment. And on top of that, your compute environment is shared as well. And it’s the same compute environment that you’re running for both experimentation and production. And so it’s not like you’re running on a totally separate compute environment where certain libraries will not work on the production environment. It’s the same environment for both data as a repository and as also for processing
PA: That makes a lot of sense. And it takes so much of the overhead of these library management processes away from the users, thank you. And probably this will be the last question. It’s from Carl and they say, “does Snowpark interface with Spark, or is it distributed compute leveraged in conversion to SQL code?”
AK: It is the latter, so it is converting to SQL. It does not leverage Spark at all—this is built entirely from scratch at Snowflake. We have been working on this for the last two and a half years. And so all the data frame operations that you define in Snowpark, yes, they get converted to SQL statements. However, it also lets you define these functions as UDFs and stored PROCs that you can write in native Python, and we have a separate Python runtime in the Snowflake compute environment. And that gets parallelized as well. So it’s not as simple as taking everything and converting it to SQL. There are bits like UDFs and stored PROCs that are running native Python at scale in a distributed fashion within the Snowflake cluster.
PA: Got it. That makes so much sense. Yeah. Thank you so much for sharing Ahmad. We could not get to all the questions today, so what would be the best way to connect with you in case audience members have further questions?
AK: Yeah, absolutely. Connect with me on LinkedIn. I think my speaker profile should have that. You can reach out over Twitter as well.
See what you missed at the Enterprise LLMs Virtual Summit!
We have released individual recordings of all eight sessions from the well-attended Enterprise LLM Virtual Summit. You can see them—including the very lively Q&A session—here.