import pandas as pd
import sqlite3
So far we've only talked about reading data from CSV files. That's a pretty common way to store data, but there are many others! Pandas can read from HTML, JSON, SQL, Excel (!!!), HDF5, Stata, and a few other things. In this chapter we'll talk about reading data from SQL databases.
You can read data from a SQL database using the pd.read_sql
function. read_sql
will automatically convert SQL column names to DataFrame column names.
read_sql
takes 2 arguments: a SELECT
statement, and a database connection object. This is great because it means you can read from any kind of SQL database -- it doesn't matter if it's MySQL, SQLite, PostgreSQL, or something else.
This example reads from a SQLite database, but any other database would work the same way.
con = sqlite3.connect("./data/weather_2012.sqlite")
df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con)
df
id | date_time | temp | |
---|---|---|---|
0 | 1 | 2012-01-01 00:00:00 | -1.8 |
1 | 2 | 2012-01-01 01:00:00 | -1.8 |
2 | 3 | 2012-01-01 02:00:00 | -1.8 |
read_sql
doesn't automatically set the primary key (id
) to be the index of the dataframe. You can make it do that by adding an index_col
argument to read_sql
.
If you've used read_csv
a lot, you may have seen that it has an index_col
argument as well. This one behaves the same way.
df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con, index_col='id')
df
date_time | temp | |
---|---|---|
id | ||
1 | 2012-01-01 00:00:00 | -1.8 |
2 | 2012-01-01 01:00:00 | -1.8 |
3 | 2012-01-01 02:00:00 | -1.8 |
If you want your dataframe to be indexed by more than one column, you can give a list of columns to index_col
:
df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con,
index_col=['id', 'date_time'])
df
temp | ||
---|---|---|
id | date_time | |
1 | 2012-01-01 00:00:00 | -1.8 |
2 | 2012-01-01 01:00:00 | -1.8 |
3 | 2012-01-01 02:00:00 | -1.8 |
Pandas has a write_frame
function which creates a database table from a dataframe. Right now this only works for SQLite databases. Let's use it to move our 2012 weather data into SQL.
You'll notice that this function is in pd.io.sql
. There are a ton of useful functions for reading and writing various kind of data in pd.io
, and it's worth spending some time exploring them. (see the documentation!)
weather_df = pd.read_csv('./data/weather_2012.csv')
con = sqlite3.connect("./data/test_db.sqlite")
con.execute("DROP TABLE IF EXISTS weather_2012")
weather_df.to_sql("weather_2012", con)
/Users/erik/.virtualenvs/cookbook/lib/python3.8/site-packages/pandas/core/generic.py:2602: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores. sql.to_sql(
We can now read from the weather_2012
table in test_db.sqlite
, and we see that we get the same data back:
con = sqlite3.connect("../data/test_db.sqlite")
df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con)
df
index | Date/Time | Temp (C) | Dew Point Temp (C) | Rel Hum (%) | Wind Spd (km/h) | Visibility (km) | Stn Press (kPa) | Weather | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2012-01-01 00:00:00 | -1.8 | -3.9 | 86 | 4 | 8.0 | 101.24 | Fog |
1 | 1 | 2012-01-01 01:00:00 | -1.8 | -3.7 | 87 | 4 | 8.0 | 101.24 | Fog |
2 | 2 | 2012-01-01 02:00:00 | -1.8 | -3.4 | 89 | 7 | 4.0 | 101.26 | Freezing Drizzle,Fog |
The nice thing about having your data in a database is that you can do arbitrary SQL queries. This is cool especially if you're more familiar with SQL. Here's an example of sorting by the Weather column:
con = sqlite3.connect("./data/test_db.sqlite")
df = pd.read_sql("SELECT * from weather_2012 ORDER BY Weather LIMIT 3", con)
df
index | Date/Time | Temp (C) | Dew Point Temp (C) | Rel Hum (%) | Wind Spd (km/h) | Visibility (km) | Stn Press (kPa) | Weather | |
---|---|---|---|---|---|---|---|---|---|
0 | 67 | 2012-01-03 19:00:00 | -16.9 | -24.8 | 50 | 24 | 25.0 | 101.74 | Clear |
1 | 114 | 2012-01-05 18:00:00 | -7.1 | -14.4 | 56 | 11 | 25.0 | 100.71 | Clear |
2 | 115 | 2012-01-05 19:00:00 | -9.2 | -15.4 | 61 | 7 | 25.0 | 100.80 | Clear |
If you have a PostgreSQL database or MySQL database, reading from it works exactly the same way as reading from a SQLite database. You create a connection using psycopg2.connect()
or MySQLdb.connect()
, and then use
pd.read_sql("SELECT whatever from your_table", con)
To connect to a MySQL database:
Note: For these to work, you will need a working MySQL / PostgreSQL database, with the correct localhost, database name, etc.
To connect to a PostgreSQL database: