4 practical examples showcasing different ways to use SQL and Python in tandem.
TL;DR: You can combine SQL and Python for advanced data manipulation in your day-to-day workflows. This post covers four examples of this in practice: ETL processes using dbt and Snowpark, A/B test analysis, customer churn prediction, and building interactive dashboards. We also discuss how combining SQL's data retrieval capabilities with Python's advanced processing and visualization tools can significantly improve efficiency and insights in data-driven tasks, while also introducing Polars as a high-performance alternative to Pandas for handling large datasets.
In a previous post, we talked about why working with SQL and Python in tandem is crucial for exploratory data analysis (EDA) and ad hoc analysis. As data practitioners, combining these two powerful tools can unlock a new level of efficiency and insight in our workflows.
In this post, we’ll walk through four practical examples of how SQL and Python can be used together to tackle complex data tasks:
- Leveraging Python during the transformation using dbt and Snowpark
- Analyzing statistical significance from A/B test results
- Predicting customer churn
- Building interactive dashboards w/ Plotly
We’ll also touch on Polars DataFrames, a new and emerging technology that’s quickly gaining traction in the data community. Polars offers a significant performance boost over traditional Pandas DataFrames, making it a great option for handling large datasets and performing complex transformations. If you haven’t explored Polars yet, this is a perfect opportunity to see how it can fit into your SQL and Python workflow.
Common situations where Python and SQL integrate together
Before diving into the examples at length, let’s briefly discuss the most common scenarios where SQL and Python work best together, especially in the context of EDA and ad hoc analysis:
- Advanced data integration and ETL (Extract, Transform, Load): Combining SQL’s data extraction capabilities with Python’s transformation and loading functions allows for a seamless ETL process that can handle large volumes of data efficiently. We’re going to use an example using a hybrid SQL and Python approach and Snowpark.
- Data-driven decision-making and A/B Testing: SQL can quickly retrieve experimental data, and Python can be used to perform statistical tests and visualize the results, helping teams make informed decisions faster.
- Advanced machine learning and feature engineering: SQL is often used to retrieve large datasets from databases, which can then be fed into Python’s machine learning models, which are great for forming predictions. Python also excels in feature engineering, a critical step in building effective models.
- Custom reporting and dashboarding: SQL can pull the necessary data, while Python can be used to create dynamic and customizable reports or dashboards, offering a more interactive and user-friendly experience.
This isn’t an exhaustive list, but the four scenarios above represent ripe opportunities to uplevel our analysis work by using SQL and Python in tandem. Let's take a look at some specific examples within each.
Python and SQL in practice: Step-by-step examples
For the example in this post, we’re going to use a fictional company “Superdope” that sells cool widgets online. They have 200 employees and data responsibilities are shared between a data engineer and a data scientist.
1. Advanced data integration and ETL (Extract, Transform, Load) w/ dbt and Snowpark
In modern data workflows, integrating data from multiple sources, cleaning it, and loading it into a data warehouse is essential.
Traditionally, SQL has been the go-to tool for these tasks, but now, with advancements like dbt and Snowpark from Snowflake, you can directly integrate Python into your ETL processes. This approach lets you do two things:
- Hybrid transformations: Snowpark lets you write Python code that runs directly within Snowflake so you can perform complex data manipulations and apply machine learning models without leaving your data warehouse environment.
- Seamless integration: dbt typically handles SQL-based transformations, but, with Snowpark, you can embed Python into your dbt models and create a powerful hybrid transformation layer that leverages the strengths of both SQL and Python.
So let’s take a look at Superdope’s use case. They have customer reviews from their online store, and they want to score the sentiment of each customer on a regular basis for better target marketing.
First we want to retrieve the most recent reviews since we don’t want to process the entire dataset every time:
Then we want to apply the sentiment analysis:
And then of course we want to then run the model
You can see from this example how we were able to do some analysis using Python that you would not have been able to do just with SQL and bake that directly into your data pipeline.
2. Data-driven decision-making for A/B testing
It’s hard to make data-driven decisions without using A/B testing at some point. Often, we’re using SQL to retrieve experimental data, and Python to perform statistical analysis and visualize the results.
ℹ️ Note for Fabi.ai users: In a lot of cases, A/B testing data may come in the form of a file. Even if your data does not live in a data warehouse, you can upload it to Fabi.ai and store it in a temporary database to accomplish the following example.
For this example, let’s say that our friends at Superdope are testing out a new checkout page to increase conversion.
First we want to retrieve the A/B test data and group it by week so that we can analyze the trend:
Then we want to calculate the statistical significance (note: we’re not grouping by week here, but doing so would be helpful to plot the trends of the two tests over time as the experiment is running):
If your A/B test is just an export from your marketing platform, you can obviously skip the SQL step, but if you do store these results in a database or data warehouse, you may find this handy. You can see how we were able to quickly filter on the relevant experiment and measure the significance in just a few lines of Python.
If you’re on a RevOps or MarketingOps team and wondering why you might not do this in dbt and store the results back in your data warehouse it’s because A/B tests tend to be more ephemeral. You may not want to go through the process of baking this into your model, which can also require advanced data engineering skills which you may not have readily available.
3. Advanced machine learning for churn prediction
Machine learning is an extremely powerful tool for predicting future outcomes and is something you can use to do custom lead scoring, sales forecasting, or even churn prediction. For the example below, we’ll focus on the churn prediction use case.
When building machine learning models, we can use SQL to efficiently retrieve large datasets from databases, which we can then feed into Python’s machine learning libraries like scikit-learn to complete the prediction. Additionally, Python’s flexibility allows for advanced feature engineering, which you may need, depending on the quality of your data and the types of machine learning models you decide to use. For example, if you’re doing a regression analysis, there’s a good chance you’ll have to find some way to handle null values or cardinal values.
ℹ️ A note on Polars here: They don’t integrate well with a lot of machine learning libraries like scikit-learn. They’re constantly building more support for common libraries, but as of the time of writing this post, in its current state, it's easier to work with more traditional pandas dataframes.
OK, now let’s look at the example for Superdope. They want to use the customer sentiment score we calculated earlier to try and predict high risk customers.
First, they’ll need to retrieve customer information and create some of the baseline features they want to use for our model. In this simple example, we’ll use average order value (AOV), number of purchases and average sentiment score:
Then, we’ll build a composite “health_score,” which we can use to label customers as “high risk” or not “high risk”. Once we’ve done that, we can build our prediction model using a simple random forecast classifier (or insert whatever model you would rather use here).
Once you’ve built and tested this model, you can either do a one-off analysis on customers that aren’t yet scored, or you can consider inserting this prediction into your data model (see example above about doing this in dbt).
4. Custom reporting & dashboarding w/ Plotly
Custom reporting and dashboarding are essential for delivering insights to stakeholders. Often, data practitioners deliver this reporting via a business intelligence (BI) tool.
Unfortunately, most BI solutions don’t offer integrated Python kernels. So if you have to do some advanced data analysis that’s best done (or can only be done) using Python, the analysis has to happen outside of BI, written back to your data warehouse then uploaded to BI. This is a lot of heavy lifting, especially if you’re still in the exploratory phase.
If we’re looking to customize our reporting beyond the capabilities of common BI tools, we can use SQL to pull the necessary data, and Python can create dynamic and customizable reports or dashboards. This approach makes it possible to integrate machine learning in our models or create richer interactions with charts and tables in the user interface.
Let’s go back to Superdope for this last example. They built their customer risk scoring model, they were able to write the “high_risk” field back to their data warehouse, and now their stakeholders want an internal tool to pull this data. We’ll keep this example very simple, but with data apps and Python, you can start building all sorts of interesting tools to, for example, let your users upload files and move away from relying on data being in the warehouse.
In the first piece of this script, we write a function to retrieve the data from the data warehouse and store it as a DataFrame.
The second piece of this script is the main part where we build the Streamlit app and build the UX components for the end-user, including the filters, table and export functionality.
In this example we used Streamlit, but you can accomplish this with Jupyter notebook plugins like Voilà or using Plotly Dash. These are all powerful solutions, the only two important pieces to keep in mind is that in order to share these reports (or data apps) out with your stakeholders, you’ll need to host these and manage the service along with things like the authentication method. These solutions also quickly run into scalability which may impact how much data you can process and/or your user’s experience.
Another more scalable and convenient option is to use Fabi.ai, which lets you publish interactive Python dashboards in just a few clicks.
What’s next: You’re ready to start jamming with SQL and Python
When you combine SQL and Python together in your data workflows you can seriously upgrade your capabilities as a practitioner. And, with the addition of Polars, you can handle larger datasets and perform more complex transformations faster than ever before. Whether you’re working on ETL processes, real-time data analysis, A/B testing, churn prediction, sales forecasting, or custom reporting, SQL and Python together offer a versatile and efficient solution.
If you’re interested in exploring these use cases, but want to cut down on some of the manual work that goes into advanced analysis (and sharing it with stakeholders), you can check out Fabi.ai for free and get up and running in less than five minutes.