Python: Pandas → MySQL using SQLAlchemy. A.k.a SQLAlchemy for Pandas users who don’t know SQL (the brave and the foolhardy)

Ok, so figuring out SQL (i.e. being able to connect anything I’m doing in Python to an SQL database) has been high on my list of priorities for a while. Thankfully, I finally found a NEED to figure this out recently, which drove me to learn it quickly. Below is a ‘silly’ example I’ve used in an attempt to make this knowledge as accessible as possible for those with experience with Pandas and DataFrames, with an understanding of the concept of merging/joining datasets but with limited or no exposure to relational databases.

I saw my first glimpse of SQLAlchemy at PyGotham in August. Mike Bayer gave a great presentation on it. One of the points he explicitly mentioned was that he did not recommend learning SQLAlechemy if you didn’t already know SQL. Ok, well that makes a lot of sense, especially since a lot of the syntax of SQLAlechmy is SQL like. None the less, I decided to completely ignore his advise and do both at the same time.

This post covers the process of creating an SQL database from Python using SQLAlchemy, but doesn’t cover any information on using the database after it has been created (from Python or otherwise).

A ‘silly’ example

  1. Creating a MySQL database
  2. Some silly data (in a DataFrame)
  3. Designing the structure of the relational database (still using DataFrames)
  4. Define the structure of the database using SQLAlchemy
  5. Pandas → MySQL

1. Creating a MySQL database

I’m using the Homebrew installed version of MySQL. If you’re on a mac and you don’t have homebrew. Homebrew is a mac installer for packages/libraries/etc that works alongside Apple’s installers. If typing brew into your terminal doesn’t return an error, you’re set. Otherwise run this command:

$ ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

If you receive an SSL certificate error:

$ ruby -e "$(curl --insecure -fsSL https://raw.github.com/mxcl/homebrew/go)"

Then:

$ brew install mysql

Now MySQL should be ready to use. Start MySQL:

$ mysql.server start

Starting MySQL
. SUCCESS!

Next, reset the root password:

$ mysqladmin -u root password 'NEW_PASSWORD'

Next open a connection to the server (I’m connecting to a local server on my machine). You will be prompted for the new password you just created:

$ mysql -h localhost -u root -p

Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.6.19 Homebrew

Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

Now you can create an empty database named parrot_db using this command:

mysql> CREATE DATABASE parrot_db;

Query OK, 1 row affected (0.00 sec)

Success! The database now exists. We will structure the database and add data to it from Python using SQLAlchemy. Exit MySQL (note that the server will still be running so that we can access the database from Python):

mysql> quit;

2. Some silly data (in a DataFrame)

Suppose you had the following simple dataset, containing data from the Dead Parrot Sketch (stored in a pandas.DataFrame naturally):


>>> print parrot_df
   Featured in dead parrot sketch      Actor Name
0                               1  Graham Chapman
1                               0       Eric Idle
2                               0   Terry Gilliam
3                               1     Terry Jones
4                               1     John Cleese
5                               1   Michael Palin

Presently we have a dataset that contains a unique entry for every Actor in the Monty Python. Meaning that each value in the DataFrame index uniquely defines an actor. We could then think of the index as the actor_id , so lets name it that:


>>> parrot_df.index.name = 'actor_id'
>>> print parrot_df
          Featured in dead parrot sketch            Name
actor_id
0                                      1  Graham Chapman
1                                      0       Eric Idle
2                                      0   Terry Gilliam
3                                      1     Terry Jones
4                                      1     John Cleese
5                                      1   Michael Palin

Now suppose we wanted to add two more columns to the dataset, the number of characters played by each actor in the sketch and the name of each character. Then we might end up with a dataset like this (depending on your opinion of the transformative properties of the moustache):


           num characters played in sketch  Actor Name      Character                          Featured in dead parrot sketch
actor_id
0                                        1  Graham Chapman  no-nonsense colonel                1
1                                        0       Eric Idle  None                               0
2                                        0   Terry Gilliam  None                               0
3                                        1     Terry Jones  Station attendant/Brain Surgeon    1
4                                        1     John Cleese  Mr Praline/Customer                1
5                                        2   Michael Palin  Original Shopkeeper                1
5                                        2   Michael Palin  Shopkeeper in Bolton/Ipswich       1

3. Structuring the relational database

Now we have all the data we need in the dataset, but the index is no longer a row identifier, it identifies each actor, as it did before the new data was added. Those actors playing more than one character will appear more than once in the dataset. Meaning that we now have repeated information in the table. We can remove the redundancy by transforming the dataset into a relational database, by creating two tables, the first with information pertaining to each actor and another with information pertaining to each Character:


>>> print parrot_actor_df
       Actor Name  Featured in dead parrot sketch  num characters played in sketch
actor_id
0  Graham Chapman                               1                                1
1       Eric Idle                               0                                0
2   Terry Gilliam                               0                                0
3     Terry Jones                               1                                1
4     John Cleese                               1                                1
5   Michael Palin                               1                                2

>>> print parrot_char_df
       Actor Name                        Character
char_id
0  Graham Chapman              no-nonsense colonel
1     Terry Jones  Station attendant/Brain Surgeon
2     John Cleese              Mr Praline/Customer
3   Michael Palin              Original Shopkeeper
4   Michael Palin     Shopkeeper in Bolton/Ipswich

Now each index of the two DataFrames uniquely defines an actor and character respectively. We’ve moved from 35 entries in the original dataset to 28 in the two DataFrames. If at anytime we wanted to merge the data from the two tables, we could do so using the Actor Name, but string variables don’t make for very robust keys for merging. So instead of using the Actor Name variable to connect the tables, we can use the actor_id variable we created that uniquely identifies each actor:

>>> print parrot_actor_df
       Actor Name  Featured in dead parrot sketch  num characters played in sketch
actor_id
0  Graham Chapman                               1                                1
1       Eric Idle                               0                                0
2   Terry Gilliam                               0                                0
3     Terry Jones                               1                                1
4     John Cleese                               1                                1
5   Michael Palin                               1                                2

>>> print parrot_char_df
         actor_id                        Character
char_id
0               0              no-nonsense colonel
1               3  Station attendant/Brain Surgeon
2               4              Mr Praline/Customer
3               5              Original Shopkeeper
4               5     Shopkeeper in Bolton/Ipswich

We have now essentially created a relational database. We just need to move the data into a MySQL database. But first some essential terminology. The actor_id is known as the primary_key of the actor table and the foreign_key of the character table. The primary key parrot_actor_df.actor_id should always uniquely identify the data in the actor table. While the foreign key, parrot_char_df.actor_id may have duplicate entries in the character table and can be used to merge the two tables together.

NOTE: For larger problems, figuring out exactly how to structure the data in the relational tables and actually manipulating the original dataset can be a lot of work. Moreover, its very challenging to alter the structure of an SQL database after it has been created and put into use so this step is worth taking your time over.

4. Define the structure of your database using SQLAlchemy

It’s probably a good idea to read up on SQLAlchemy a little before starting, which you can do here. I’m going to dive right in. Let’s start with the Engine, here’s what SQLAlchemy says:

Engine Configuration

The Engine is the starting point for any SQLAlchemy application. It’s “home base” for the actual database and its DBAPI, delivered to the SQLAlchemy application through a connection pool and a Dialect, which describes how to talk to a specific kind of database/DBAPI combination.

The general structure can be illustrated as follows:
Engine
Where above, an Engine references both a Dialect and a Pool, which together interpret the DBAPI’s module functions as well as the behavior of the database.

Creating an engine is just a matter of issuing a single call, create_engine():

from sqlalchemy import create_engine
engine = create_engine('postgresql://scott:tiger@localhost:5432/mydatabase') 

Put simply, we can create an instance of the class create_engine that will allow us to interact with our database from Python. To create an engine instance that is connected to a local MySQL database named parrot_db:

from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:"+'NEW PASSWORD'+"@localhost/parrot_db")

Recall we defined ‘NEW PASSWORD’ in Section 1.

Next, we need to design the MetaData (data about the containers of data) environment of our database. Create a MetaData object that is linked to our engine using the following command:

meta = MetaData(bind=engine)

We know we want the database to have two tables actors and characters. We need to explicitly define the datatype of each variable in each table, as well as identify any primary or foreign keys (SQLAlchemy gives us some easy to use datatype objects TEXT and Integer which make this pretty easy):

from sqlalchemy import create_engine, MetaData, TEXT, Integer, Table, Column, ForeignKey 

## ACTOR PARROT TABLE ###
table_actors = Table('actors', meta,
    Column('actor_id', Integer, primary_key=True, autoincrement=False),
    Column('Actor Name', TEXT, nullable=True),
    Column('Featured in dead parrot sketch', TEXT, nullable=True),
    Column('num characters played in sketch', TEXT, nullable=True)
)

## CHARACTER PARROT TABLE ###
table_characters = Table('characters', meta,
    Column('Character', TEXT, nullable=True),
    Column('actor_id', Integer, ForeignKey('actors.actor_id'))
)
  • The variables names defined here must exactly match the variables names in the pandas.DataFrames created in Section 3.
  • The primary_key specifies that the variable is a primary key
  • The foreign key is specified using the ForeignKey object, which allows you to specify the location of the primary key.
  • nullable tells the database whether the variable may accept null values or not, here I’m being as general as possible and allowing everything to be nullable.
  • Lastly autoincrement is specified as false since we explicitly define the primary key actor_id in the DataFrame. If we hadn’t done so, the auto increment option can be used to create a default index. NOTE: The autoincrement option will create an index beginning at 1, rather indexing from 0, as in Python. I don’t think it’s an issue to index from 0 in a MySQL database so I’m doing it, but I think the standard is to begin at 1.

Finally send the table data you have created to the MySQL database:

meta.create_all(engine)

5. Pandas → MySQL

Marvelous, our database is ready for data! This bit is pretty easy now since Pandas.DataFrame has a method called to_sql():

parrot_actor_df.to_sql('actors',engine,flavor='mysql', if_exists='append',index=True)
parrot_char_df.to_sql('characters',engine,flavor='mysql', if_exists='append',index=False)
  • Specifiy the name of the table to send the data to, the engine being used, flavor='mysql' since we are working with a MySQL database.
  • The if_exists option gives you three possible ways to treat an existing table, ‘fail’, ‘replace’, or ‘append’}:
    • fail: If table exists, do nothing.
    • replace: If table exists, drop it, recreate it, and insert data.
    • append: If table exists, insert data. Create if does not exist.

    It is important that you ALWAYS specify append here. The tables will ALWAYS already exist, we created them in Section 4. We don’t want to replace the tables because we would lose the MetaData we previously specified.

  • The index option allows you specify whether the index in the DataFrame should be included in the MySQL table. For the actor table we need this index, it is explicitly defined as the primary key of the table. This is why we explicitly named the index actor_id in the DataFrame. For Characters we not use the index of the DataFrame. We could of course use this index if we wanted to. We could have named it char_id and used it as primary key for the characters table to connect to another table.

Thats it. All the code used to create this example is saved here. Now you can access your database from MySQL and run SQL Queries to your hearts content, but how to go about doing that is another story for another time, you’ve had your code nourishment for the day.

The NEED – The SxSW PanelPicker

A friend of mine asked me to help him scrape and analyze some data from the South By Southwest (SxSW) panel picker. The panel picker is an online process that allows the SxSW community to have a voice in what sessions are scheduled at future SxSW Festivals. The community uploads proposals for panels, solo presentations, workshops, etc. Then the community votes on which event proposals they think are the best. Heres an example entry. The end goal being to observe

  • The distribution of panels across the four categories: Film, Music, Interactive and Education
  • The most frequent Speakers and companies proposing panels.
  • Use Natural Language Processing (NLP) to observe the most common ngrams in proposal titles and descriptions to infer what was the most commonly seen content of each panel suggestion.

After completing the analysis I was able to connect the data directly to a MySQL database as illustrated in the example above and have the data be interactively queried in this this beautiful front end. All the code used to scrape the data, do the analysis, and create and connect the MySQL database can be found here.

Screen Shot 2014-10-12 at 5.01.24 PM

One thought on “Python: Pandas → MySQL using SQLAlchemy. A.k.a SQLAlchemy for Pandas users who don’t know SQL (the brave and the foolhardy)

Leave a comment