Got data? Python and SQLite Can Help

Note: This post continues on with context established in a previous post, which can be found here. The TL;DR result of that post is that I am now able to scrape real-time transit data from Calgary Transit’s website.

Working in an academic setting, I am exposed to a large amount of data. More to the point, however, is that most of this time this data is poorly organized, sometimes with thousands of folders and text files that are ambiguously labelled with names like “Run 1” or “Day 4”. While this organization has meaning to those involved in its creation, I feel like a large amount of useful data is never seen again because it’s hard to understand or decipher it’s meaning.

This is why I am a huge fan of database design, even for relatively simple projects. With a little bit of practice, databases can be set up very easily and quickly, and save you a lot of time in the long run. If you can create a well structured database for your data, you have in fact inserted extra information. The structure of a database contains clues as to the meaning of the data itself, and as an object oriented programmer, structured databases represent real world objects. A secondary advantage of this is that you can easily translate that data structure into back-end code.

Database Design

Before I get into the details, I’m going to give you my basic philosophy on database design. This is based purely of of experience creating and working with all kinds of databases, and does not come from any official literature (though the books might say the same thing, I don’t know). Take my advice with a grain of salt, though I would argue that having some sort of organizational philosophy is already a great improvement over starting from scratch. With all that said, here are a few of my “tenets of database design”:

  1. Data represents real world events and objects. Therefore, you database should reflect an understanding of where one object ends and another begins.
  2. Never store a piece of information more than once.
  3. Make all information expandable.

The first in the list is perhaps the most difficult one to understand, unless you are already familiar with the concepts of objects in a software design sense, so let’s motivate it with an example. Say you are collecting data about the operation of a transit system (for simplicity let’s say a bus network). Take a moment and think about the different “objects” that might exist on such a network, what their characteristics are, and how they might be related. Here’s the list I came up with (I’m sure there are many more – sometimes there’s data just not worth collecting):

  • Objects (and some characteristics): Route (name, length, direction), Stop (number, location, type), Trip (start time)
  • Relationships: Routes have many stops, stops may serve many routes. Trips happen on one route, and serve many stops.

The language introduced is important: key words like “many” and “one” are important in developing relationships, and a good database will understand those relationships and even force them to exist.

Once you have thought that far, you’ve come a long way. You can start making tables, with columns that represent attributes and links. Go-between tables will help with many-to-one and many-to-many relationships. In fear of running on too long, I am going to finish with a brief explanation of the other two.

The second point is pretty straightforward. If you store a piece of information in one table, don’t store it in another. You might think you have to, but you don’t. If a piece of information seems to fit in two tables, then those two objects are related, and as long as you have a relationship you can access the the data across those tables with the magic of SQL. The last point is something that I find makes things a little more work sometimes, but can have huge payoffs. If you have an object with a fixed list of characteristics (a car with a list of colours, for example), make a separate table for that list and create a link for it. That way, you have consistent naming of the colours, you can add colours if you need, and you can add additional attributes (maybe a hex code for the colour) later without having to insert a column into a much larger table (a table of all the cars produced in a factory for example). In that way, you’ve made your information expandable.

Python and Sqlite3

Okay – now on to the real stuff. We have our transit data that we have gathered from scraping a website (outlined in this post), and we want to compare it with static data provided by Calgary Transit. This data is actually a dump from an SQLite database, so this provides a good way of reverse engineering a database from the text files it outputs. I won’t go into detail, but you’ll need to have a SQLite database design program (just try Google), many of which let you import tables from CSV files directly. You can also get the SQLite database file I constructed from my Bitbucket repository. From the website, I was able to get a “trip id”, which I now need to look up in the database:

SELECT departure_time, trip_id FROM stop_time
WHERE trip_id LIKE "28668930%"
AND stop_id = "5472"

The “LIKE” command allows me to match the first part of a string (since the trip ID provided in the static data contains more information about the type of trip – an good example of storing data twice when it’s not needed). This information produces a unique result from which I’ve extracted the departure time and the trip id. The essential Python that surrounds this is:

import sqlite3
trip_id = "28668930"
stop_id = "5472"
with sqlite3.connect("transit_info.db") as conn:
    c = conn.cursor()
    c.execute('''SELECT departure_time, trip_id FROM stop_time
    WHERE trip_id LIKE ?
    AND stop_id = ?''', [trip_id+"%", stop_id])
    result = c.fetchone()
    print("Departure time: {}".format(result[0]))

The cursor’s execute function takes two parameters: The SQL code, and a list of variables to parse into the code in place of the question marks. The “fetchone” function grabs the first result (fetchall grabs a list) and you can see each column by accessing the correct index of the result. The rest is up to you!

Leave a Reply

Your email address will not be published. Required fields are marked *