SQL in Pandas with Pandasql – KDnuggets

Want to query your pandas dataframes using SQL? Learn how to do so using the Python library Pandasql.

XXXXX
Image by Author

If you can add only one skill—and inarguably the most important—to your data science toolbox, it is SQL. In the Python data analysis ecosystem, however, pandas is a powerful and popular library.
But, if you are new to pandas, learning your way around pandas functions—for grouping, aggregation, joins, and more—can be overwhelming. It would be much easier to query your dataframes with SQL instead. The pandasql library lets you do just that!
So let’s learn how to use the pandasql library to run SQL queries on a pandas dataframe on  a sample dataset.

Before we go any further, let’s set up our working environment.

If you’re using Google Colab, you can install pandasql using `pip` and code along:

If you’re using Python on your local machine, ensure that you have pandas and Seaborn installed in a dedicated virtual environment for this project. You can use the built-in venv package to create and manage virtual environments.
I’m running Python 3.11 on Ubuntu LTS 22.04. So the following instructions are for Ubuntu (should also work on a Mac). If you’re on a Windows machine, follow these instructions to create and activate virtual environments.
To create a virtual environment (v1 here), run the following command in your project directory:

Then activate the virtual environment:

Now install pandas, seaborn, and pandasql:

Note: If you don’t already have `pip` installed, you can update the system packages and install it by running: apt install python3-pip.

To run SQL queries on a pandas dataframe, you can import and use sqldf with the following syntax:

Here,

Let’s start by importing the required packages and the sqldf  function from pandasql:

Because we’ll run several queries on the dataframe, we can define a function so we can call it by passing in the query as the argument:

For all the examples that follow, we’ll run the run_query function (that uses sqldf() under the hood) to execute the SQL query on the tips_df dataframe. We’ll then print out the returned result.

For this tutorial, we’ll use the “tips” dataset built into the Seaborn library. The “tips” dataset contains information about restaurant tips, including the total bill, tip amount, gender of the payer, day of the week, and more.
Lload the “tip” dataset into the dataframe tips_df:

Here’s our first query—a simple SELECT statement:

As seen, this query selects all the columns from the tips_df dataframe, and limits the output to the first 10 rows using the `LIMIT` keyword. It is equivalent to performing tips_df.head(10) in pandas:

XXXXX
Output of query_1

Next, let’s write a query to filter the results based on conditions:

This query filters the tips_df dataframe based on the condition specified in the WHERE clause. It selects all columns from the tips_df dataframe where the ‘total_bill’ is greater than 30 and the ‘tip’ amount is greater than 5.
Running query_2 gives the following result:

XXXXX
Output of query_2

Let’s run the following query to get the average bill amount grouped by the day:

Here’s the output:

XXXXX
Output of query_3

We see that the average bill amount on weekends is marginally higher.
Let’s take another example for grouping and aggregations. Consider the following query:

The query query_4 groups the data in the tips_df dataframe by the ‘day’ column and calculates the following aggregate functions for each group:
As seen, we get the above quantities grouped by the day:

XXXXX
Output of query_4

Let’s add an example query that uses a subquery:

Here,
Running query_5 gives the following:

XXXXX
Output of query_5

We only have one dataframe. To perform a simple join, let’s create another dataframe like so:

The other_data dataframe associates each day with a special event.
Let’s now perform a LEFT JOIN between the tips_df and the other_data dataframes on the common ‘day’ column:

Here’s the result of the join operation:

XXXXX
Output of query_6

In this tutorial, we went over how to run SQL queries on pandas dataframes using pandasql. Though pandasql makes querying dataframes with SQL super simple, there are some limitations.
The key limitation is that pandasql can be several orders slower than native pandas. So what should you do? Well, if you need to perform data analysis with pandas, you can use pandasql to query dataframes when you are learning pandas—and ramping up quickly. You can then switch to pandas or another library like Polars once you’re comfortable with pandas.
To take the first steps in this direction, try writing and running the pandas equivalents of the SQL queries that we’ve run so far. All the code examples used in this tutorial are on GitHub. Keep coding!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.

Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy
Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.


By subscribing you accept KDnuggets Privacy Policy
Subscribe To Our Newsletter
(Get The Complete Collection of Data Science Cheat Sheets & Great Big NLP Primer ebook)
Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.
By subscribing you accept KDnuggets Privacy Policy
Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.
By subscribing you accept KDnuggets Privacy Policy

source

Leave a Comment