cvitter / ATO2016

Repo for my All Things Open 2016 Presentation: Easy Time Series Analysis with NoSQL, Python, Pandas & Jupyter

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

All Things Open 2016 - Easy Time Series Analysis

This GitHub repository contains instructions and code to reproduce the work demonstrated in my All Things Open 2016 Presentation: Easy Time Series Analysis with NoSQL, Python, Pandas and Jupyter (https://allthingsopen.org/talk/easy-time-series-analysis-with-nosql-python-pandas-jupyter/).

To run the examples contained in this repo you will need to:

  1. Download and install Riak TS (http://docs.basho.com/riak/ts/);
  2. Have Python installed on your local computer;
  3. Install the Riak Python client using either Easy Install (easy_install riak) or Pip (pip install riak) - Note: Please use Version 2.5.5 or later of the Python client;
  4. Install Jupyter Notebook (http://jupyter.org/);
  5. Install Pandas (http://pandas.pydata.org/);
  6. Install matplotlib (http://matplotlib.org/)
  7. Clone this repo to your local machine;
  8. Download the Bay Area Bike Share Year 2 data (http://www.bayareabikeshare.com/open-data) used in the example code [1];

Once the pre-requisites are installed/setup on your development environment you create the table that we are going to use to read and write our data:

  1. Start Riak TS (from the command line navigate to your Riak TS root directory and execute the following command: bin/riak start);
  2. Run Create_Trip_Table.py script to create the table to store our trip data in (python Create_Trip_Table.py);
  3. Launch the Riak TS shell from the command line: bin/riak-shell
  4. Run the following commands within riak-shell to list the tables in your Riak TS database and output the new table's schema:
riak-shell(1)>SHOW TABLES;
+---------------+
|     Table     |
+---------------+
|Bike_Share_Trip|
+---------------+


riak-shell(2)>DESCRIBE Bike_Share_Trip;
+--------------+---------+-------+-----------+---------+--------+----+
|    Column    |  Type   |Is Null|Primary Key|Local Key|Interval|Unit|
+--------------+---------+-------+-----------+---------+--------+----+
|   trip_id    | sint64  | false |           |         |        |    |
|   duration   | sint64  | false |           |         |        |    |
|  start_date  |timestamp| false |     1     |    1    |   7    | d  |
|start_station | varchar | false |           |         |        |    |
|start_terminal| sint64  | false |           |         |        |    |
|   end_date   |timestamp| false |           |         |        |    |
| end_station  | varchar | false |           |         |        |    |
| end_terminal | sint64  | false |           |         |        |    |
|   bike_no    | sint64  | false |           |    2    |        |    |
+--------------+---------+-------+-----------+---------+--------+----+

In the next series of steps we will load the Bay Are Bike Share data into our newly created table:

  1. Place the 201508_trip_data.csv in the directory where your python scripts are located (or update the file's location in Write_Trip_Data.py line 42);
  2. Run Write_Trip_data.py (python Write_Trip_data.py) to import the file's records into your newly created table (when the script completes the final line should say Total Records: 354151. On my MacBook Air the import averaged 80 seconds to complete which averages out to 4427 records a second.);
  3. Using riak-shell you can verify that the records have been written by running the following SELECT statement:
riak-shell(3)>SELECT * FROM Bike_Share_Trip WHERE start_date > '2014-09-01 10:00:00' AND start_date < '2014-09-01 10:30:00';
+-------+--------+--------------------+--------------------------------+--------------+--------------------+------------------------------+------------+-------+
|trip_id|duration|     start_date     |         start_station          |start_terminal|      end_date      |         end_station          |end_terminal|bike_no|
+-------+--------+--------------------+--------------------------------+--------------+--------------------+------------------------------+------------+-------+
|433020 |  130   |2014-09-01T10:02:00Z|        Clay at Battery         |      41      |2014-09-01T10:04:00Z|       Davis at Jackson       |     42     |  109  |
|433022 |  461   |2014-09-01T10:04:00Z|         Market at 10th         |      67      |2014-09-01T10:12:00Z|        Market at 4th         |     76     |  438  |
|433021 |  461   |2014-09-01T10:04:00Z|         Market at 10th         |      67      |2014-09-01T10:12:00Z|        Market at 4th         |     76     |  498  |
|433024 |  708   |2014-09-01T10:05:00Z|      Golden Gate at Polk       |      59      |2014-09-01T10:17:00Z|      Steuart at Market       |     74     |  100  |
|433025 |  1631  |2014-09-01T10:17:00Z|Grant Avenue at Columbus Avenue |      73      |2014-09-01T10:44:00Z|    Embarcadero at Sansome    |     60     |  416  |
|433026 |  1631  |2014-09-01T10:17:00Z|Grant Avenue at Columbus Avenue |      73      |2014-09-01T10:44:00Z|    Embarcadero at Sansome    |     60     |  549  |
|433027 |  109   |2014-09-01T10:22:00Z|     Embarcadero at Bryant      |      54      |2014-09-01T10:23:00Z|       Spear at Folsom        |     49     |  468  |
|433029 |  333   |2014-09-01T10:29:00Z|Castro Street and El Camino Real|      32      |2014-09-01T10:35:00Z|Mountain View Caltrain Station|     28     |  17   |
+-------+--------+--------------------+--------------------------------+--------------+--------------------+------------------------------+------------+-------+

Now that you have loaded the data (and verified that it is there) you can move on to using Jupyter Notebook, Python, Pandas, and Matplotlib to perform analysis on the data. To start Jupyter:

  1. Within your shell navigate to the the ATO2016 directory (e.g. cd ~/git/ATO2016 if that is where you cloned/copied the files to on your machine)
  2. Type jupyter notebook and hit enter

When Jupyter is finished starting up it will launch a web browser tab/window showing a listing of files in your ATO2016 directory including the following notebook files that demonstrate how to do data analysis using Python, Pandas, and Matplotlib:

Step 1. Basic Pandas and Matplotlib.ipynb

Demostrates the basics of querying Riak TS using Python and the Python Riak client, basic Pandas functionality for analyzing your data, and basic Matplotlib functionality for visualizing your data.

Step 2. Rides By Date, Day, and Bike in September 2014.ipynb

Demonstrates a few more features of Pandas that are useful for data analysis including the date library, value_counts, and sort_values while showing patterns of rides by time of day, day of the week, and bike. We also demonstrate how you can accomplish similar results using Riak TS's Group By feature.

Step 3. Rides by Station in September 2014.ipynb

This notebook shows how to analyze bike share station usage patterns using Riak TS's Group By with basic output formatting provided by Pandas for readability.

Notes

[1] It is always best to get data right from the source: http://www.bayareabikeshare.com/open-data

About

Repo for my All Things Open 2016 Presentation: Easy Time Series Analysis with NoSQL, Python, Pandas & Jupyter

License:Apache License 2.0


Languages

Language:Jupyter Notebook 74.7%Language:Python 25.3%