How to Run SQL Queries from Jupyter

5 min readOct 29, 2020

While learning the fundamentals of data science, most of the tabular data used are presented as comma-separated value (.csv) files. This has the tendency of getting data science students to think this is indicative of real — world scenarios, which is not the case.

At the end of this tutorial, you would be able to run any SQL query from a Jupyter Notebook.

Data is, more likely than not, stored in databases in the real world. Therefore, at some points, students get introduced to databases in their data science journey.

There are different kinds and classifications of databases, which we will not be able to treat here. In this article, we will be focusing SQL databases (as opposed to NoSQL databases), with illustrations using a subset of the CIA Factbook stored as an sqlite database.

Jupyter is an award — winning set of web applications for interactive computing that allows you to have live code, documentation and visualization together in a single file. What this means in practice is that you can run part of your code over and over as you work on your data science project. You can learn more on the Project Jupyter website, here or here.

This interactivity is one major reason one may be interested in running SQL queries from the Jupyter IDE instead of a traditional SQL IDE. You can run a query in each cell and compare the outputs of different cells in the same notebook instead of viewing just one output at a time.

Another reason for running SQL queries is the chance to pass the results of a SQL query as an object to be used in other languages such as Python or R.

Let’s get set up.

Setting up

First of all, you’ll need to install ipython — sql to run SQL queries directly in the Notebook. You can do typing the following in a Jupyter cell:

!pip install ipython-sql

If you are installing from the command line interface, remove the exclamation mark in front of the command:

pip install ipython-sql

Second, you need to install SQLAlchemy to take full advantage of SQL in Python. The command for this is:

!pip install SQLAlchemy

Depending on the specific database you’re using, you may need to install one of the following libraries by typing the following command into any Jupyter cell:

SQLite

pip install sqlite3

Oracle database

!pip install cx_Oracle

SQL server

!pip install pyodbc

MySQL

!pip install sql

PostgreSQL

!pip install psycopg2

Commands for other installing the corresponding modules for other databases can be found with a simple search.

We use the following code into a cell to connect our Jupyter notebook to the database file:

%%capture
%load_ext sql
%sql sqlite:///factbook.db

N.B: The database linked in the cell above is in the same directory as the Jupyter notebook; hence /factbook.db on the third line. If the database was in another directory on the local machine or in an online storage, we would have replaced /factbook.db with the path to the db file or the url of the database.

Running Queries

Now that we have connected our notebook to the database, we can run any standard query in any of our notebook cells.

To single-line queries, the query is prefixed with %sql as seen below:

A single-line SQL query — A single — line SQL query and its result

However, most SQL queries are written as multi — line code blocks for aestetic and practical reasons. To mark an entire block as a SQL code block, start the cell with %%sql . To list all the tables in the database, we type the following query:

Query to find the tables in the database

Executing this cell gives:

Tables in CIA Factbook database — Tables contained in the CIA Factbook database

Storing the Output of a Query in a Variable

One advantage of ipython-sql, and hence of Jupyter Notebooks, is that you can pass the result of a query as a Python variable.

To do that, equate the variable to a single — line query in a Jupyter cell. The result is stored in the variable and can be manipulated as a Python variable. So, for example, the number of countries whose names start with letter A in the database is 19 as seen in the query below:

Jupyter cell showing number of countries whose name start with A — Number of countries whose name start with A

Converting the Output of a SQL Query into a Pandas DataFrame

To convert the result of a query to a Pandas DataFrame, use the .DataFrame() method on the query result. And, viola, you have the result of your query as a DataFrame, opening up the result to further analysis.

Query to find countries whose name start with ‘A’ converted into a DataFrame — Query to find countries whose names start with ‘A’ converted into a DataFrame

This query gives the following DataFrame, as confirmed in the type displayed in the output of the latter cell:

Applying the Pandas sort_values() method, we have:

‘A’ countries sorted by population using the Pandas sort_values() method — `'A' countries` sorted by population using the Pandas sort_values() method

Quick Plotting

You can quickly plot the result of a query without having to convert the result to a Pandas DataFrame if you have matplotlib installed. Methods you can use include .plot(), .pie() and .bar().

Below is an example of calling .bar() directly on a SQL query result:

Code to plot a bar chart of countries with populations above 100 million

which resulted in the following plot:

Bar chart of countries with population above 100 million people

Conclusion

I hope I have been able to show you how to use Jupyter Notebooks to write SQL queries and some possibilities that open up when you use Jupyter Notebooks as a SQL IDE.

Let me know if you have any questions in the comments.

Thanks for reading.