Simple Transit Reliability Tracking with Python

A Rainy Day Project

Inspired by this video on learning python via public data hacking, I decided to create a simple data access program in order to periodically check up on the reliability of a transit system.

It just so happened that around the same time, Calgary Transit released their new website which provides GPS powered real-time data updates about the arrival of buses. What better way to learn than to dive into a brand new website and try and pick it apart?

The ultimate goal that I started with was this: Figure out a way to extract a bunch of live GPS data from buses that are arriving at a stop, and compare the real time data with the schedule.

There are two major components to this project:

  1. Access the web content and extract some sort of meaning from it
  2. Store and manage the extracted data along side the static schedule data

Lucky for us, Python is able to provide us with all the tools necessary to make this happen. I decided that to make things more interesting, I would manage all of the data in SQLite databases. This would allow me to play around with Python’s sqlite3 package, something I’ve done before but would like more practice.

Code Repository

Instead of building the entire program in front of you, I am going to show you snippets of the tools I used (mostly because I think there are better ways to go about some of the things I did), and show the important stuff. You can find the entire code repository here if you would like, complete with data I have collected from passes so far. The key files in there are transit_info.db, which is a nicely organized database of the raw schedule files found on the Calgary Transit developer page, and webscrape.py, which contains the meat of the processing software. The “results” database contains data I have collected so far, while the results.py file contains classes I used for creating entries in the results database.

Okay! Let’s get started.

Understanding the Website

The first step was figuring out how to simply access web pages that contained different stop information. After looking around on the website on my phone, I noticed that I could get all the information I needed by provided a simple querystring: “stop_id=X”. For example

http://www.calgarytransit.com/nextride?stop_id=4782

Provides information about stop 4782. The static data I posted above included a list of stop numbers, so all I had to do was check every stop number via that query string and I would have a consistent set of web pages.

Python’s urllib is great for pulling website data. I’m using Python 3.x here so yours might look a bit different.

import urllib
stop_id = 4782
with urllib.request.urlopen('http://www.calgarytransit.com/nextride?stop_id={}'.format(stop_id)) as u:
data = u.read()

The tricky part is parsing all that HTML code. Thankfully there’s Beautiful Soup, which does all the parsing for us:

from bs4 import BeautifulSoup
soup = BeautifulSoup(data)

After inspecting the HTML code for a while, I found a consistent set of tags that appeared, which can be found easily with Beautiful Soup, and data extracted:

trip_items = soup.body.findAll('div', attrs={'class':'trip-item'})
trips = [
[i['data-trip_id'],
i['data-position'],
datetime.strptime(i.find('span')['data-pretty-date'], "%b %d %Y - %H:%M:%S")]
for i in trip_items]

Easy enough, right?

I think that’s enough for one post. The SQLite management is another topic entirely, and I will link to the conclusion of this project once I have it set up.

Cheers!

Edit: Here’s the continuation.

Leave a Reply

Your email address will not be published. Required fields are marked *