How to Run SQL Queries from Jupyter
While learning the fundamentals of data science, most of the tabular data used are presented as comma-separated value (.csv) files. This has the tendency of getting data science students to think this is indicative of real — world scenarios, which is not the case.
At the end of this tutorial, you would be able to run any SQL query from a Jupyter Notebook.
Data is, more likely than not, stored in databases in the real world. Therefore, at some points, students get introduced to databases in their data science journey.
There are different kinds and classifications of databases, which we will not be able to treat here. In this article, we will be focusing SQL databases (as opposed to NoSQL databases), with illustrations using a subset of the CIA Factbook stored as an sqlite database.
Jupyter is an award — winning set of web applications for interactive computing that allows you to have live code, documentation and visualization together in a single file. What this means in practice is that you can run part of your code over and over as you work on your data science project. You can learn more on the Project Jupyter website, here or here.
This interactivity is one major reason one may be interested in running SQL queries from the Jupyter IDE instead of a traditional SQL IDE. You can run a query in each cell and compare the outputs of different cells in the same notebook instead of viewing just one output at a time.
Another reason for running SQL queries is the chance to pass the results of a SQL query as an object to be used in other languages such as Python or R.
Let’s get set up.
Setting up
First of all, you’ll need to install ipython — sql to run SQL queries directly in the Notebook. You can do typing the following in a Jupyter cell:
!pip install ipython-sql
If you are installing from the command line interface, remove the exclamation mark in front of the command:
pip install ipython-sql
Second, you need to install SQLAlchemy to take full advantage of SQL in Python. The command for this is:
!pip install SQLAlchemy
Depending on the specific database you’re using, you may need to install one of the following libraries by typing the following command into any Jupyter cell:
- SQLite
pip install sqlite3
- Oracle database
!pip install cx_Oracle
- SQL server
!pip install pyodbc
- MySQL
!pip install sql
- PostgreSQL
!pip install psycopg2
Commands for other installing the corresponding modules for other databases can be found with a simple search.
We use the following code into a cell to connect our Jupyter notebook to the database file:
%%capture
%load_ext sql
%sql sqlite:///factbook.db
N.B: The database linked in the cell above is in the same directory as the Jupyter notebook; hence /factbook.db
on the third line. If the database was in another directory on the local machine or in an online storage, we would have replaced /factbook.db
with the path to the db file or the url of the database.
Running Queries
Now that we have connected our notebook to the database, we can run any standard query in any of our notebook cells.
To single-line queries, the query is prefixed with %sql
as seen below:
However, most SQL queries are written as multi — line code blocks for aestetic and practical reasons. To mark an entire block as a SQL code block, start the cell with %%sql
. To list all the tables in the database, we type the following query:
Executing this cell gives:
Storing the Output of a Query in a Variable
One advantage of ipython-sql
, and hence of Jupyter Notebooks, is that you can pass the result of a query as a Python variable.
To do that, equate the variable to a single — line query in a Jupyter cell. The result is stored in the variable and can be manipulated as a Python variable. So, for example, the number of countries whose names start with letter A in the database is 19 as seen in the query below:
Converting the Output of a SQL Query into a Pandas DataFrame
To convert the result of a query to a Pandas DataFrame, use the .DataFrame()
method on the query result. And, viola, you have the result of your query as a DataFrame, opening up the result to further analysis.
This query gives the following DataFrame, as confirmed in the type displayed in the output of the latter cell:
Applying the Pandas sort_values() method, we have:
Quick Plotting
You can quickly plot the result of a query without having to convert the result to a Pandas DataFrame if you have matplotlib
installed. Methods you can use include .plot()
, .pie()
and .bar()
.
Below is an example of calling .bar() directly on a SQL query result:
which resulted in the following plot:
Conclusion
I hope I have been able to show you how to use Jupyter Notebooks to write SQL queries and some possibilities that open up when you use Jupyter Notebooks as a SQL IDE.
Let me know if you have any questions in the comments.
Thanks for reading.