Bikes! Part 0

Some months ago I discovered that there are a number of bike shares out there that make their data publicly available, and I’ve been meaning to download some of it and poke around. Well today I finally had some time. And though I’m not sure what I’ll be doing with this data set yet, I wanted to share a figure I made in my initial exploration.

The following figure shows the number of rides per day and the median ride distance for the Portland OR bike share (data from BikeTown: https://www.biketownpdx.com/system-data). I threw the code in a new github repo here so you can take a look if inerested, but I’m not going to go into detail yet (the code just downloads their system data files, concantenates the monthly files and does some minimal processing and plotting). In any case, the figure:

The neat (and maybe unsurprising) feature is the strong seasonality to bike share usage, both in terms of just the number of rides per day (high in summer, low in winter) and the median distance of each ride (longer rides in summer). There is an interesting spike in total rides around May 2018 — maybe excitement for springtime? or additional bikes added to the program? A plot of bike usage percent (total rides / available bikes) might be more illustrative.

So that’s that for now. Hopefully won’t be too long before I have time for some more in depth analysis.

Bikes!

 

Shapely Polygons: Coloring Shapefile Polygons

In my previous two posts, I showed how to (1) read and plot shapefile geometries using the pyshp library and (2) plot polygons using shapely and descartes. So the obvious next step is to combine the two! And that’s what I’ll cover today, again using my learning_shapefiles github repo along with the shapefile of state boundaries from census.gov.

The Final Map

In case you don’t care about the Python and are just curious about the end product, here’s the final map where the color of each state reflects its total land area:

shapefile_us_colored_by_area_sat

It’s kind of neat to see the gradient of state size from east to west, reflecting the historical expansion of the U.S. westward, but other than that, there’s not much to the map. But it does serve as a simple case for learning to manipulate shapefiles.

The Code

There are two scripts in learning_shapefiles/src of relevance for today’s post: basic_readshp_plotpoly.py and read_shp_and_rcrd.py. The first script is a simple combination of basic_read_plot.py and simple_polygons.py (from my previous two posts), plotting the shapefile geometries using polygons instead of lines, so let’s start there.

basic_readshp_plotpoly.py

The code starts out the same as basic_read_plot.py, but now also imports Polygon and PolygonPatch from shapely and descartes, before reading in the shapefile:

import shapefile
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import Polygon
from descartes.patch import PolygonPatch

"""
 IMPORT THE SHAPEFILE 
"""
shp_file_base='cb_2015_us_state_20m'
dat_dir='../shapefiles/'+shp_file_base +'/'
sf = shapefile.Reader(dat_dir+shp_file_base)

The next part of the code plots a single geometry from the shapefile. This is super easy because shapefile.Reader reads a shapefile geometry as a list of points, which is exactly what the Polygon function needs. So we can just give that list of points directly to the Polygon function:

plt.figure()
ax = plt.axes()
ax.set_aspect('equal')

shape_ex = sf.shape(5) # could break if selected shape has multiple polygons. 

# build the polygon from exterior points
polygon = Polygon(shape_ex.points)
patch = PolygonPatch(polygon, facecolor=[0,0,0.5], edgecolor=[0,0,0], alpha=0.7, zorder=2)
ax.add_patch(patch)

# use bbox (bounding box) to set plot limits
plt.xlim(shape_ex.bbox[0],shape_ex.bbox[2])
plt.ylim(shape_ex.bbox[1],shape_ex.bbox[3])

And we get Washington, now as a colored polygon rather than an outline:

shapefile_single

Woo!

And as before, we can now loop over each shape (and each part of each shape), construct a polygon and plot it:

""" PLOTS ALL SHAPES AND PARTS """
plt.figure()
ax = plt.axes() # add the axes
ax.set_aspect('equal')

icolor = 1
for shape in list(sf.iterShapes()):

    # define polygon fill color (facecolor) RGB values:
    R = (float(icolor)-1.0)/52.0
    G = 0
    B = 0

    # check number of parts (could use MultiPolygon class of shapely?)
    nparts = len(shape.parts) # total parts
    if nparts == 1:
       polygon = Polygon(shape.points)
       patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)
       ax.add_patch(patch)

    else: # loop over parts of each shape, plot separately
      for ip in range(nparts): # loop over parts, plot separately
          i0=shape.parts[ip]
          if ip < nparts-1:
             i1 = shape.parts[ip+1]-1
          else:
             i1 = len(shape.points)

          polygon = Polygon(shape.points[i0:i1+1])
          patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)
          ax.add_patch(patch)

    icolor = icolor + 1

plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()

In order to distinguish each polygon, I set each shape’s color based on how many shapes have already been plotted:

R = (float(icolor)-1.0)/52.0

This grades the red scale in an RGB tuple between 0 and 1 (since there are 52 shapes), and it is then used in the facecolor argument of PolygonPatch. The coloring is simply a function of the order in which the shapes are accessed:

shapefile_us

The goal, however, is to color each polygon by some sort of data so that we can actually learn something interesting, and that is exactly what read_shp_and_rcrd.py does.

read_shp_and_rcrd.py

Up to now, we’ve only considered the shape geometry, but that is only one part of a shapefile. Also included in most shapefiles are the records, or the data, associated with each shape. When a shapefile is imported,

shp_file_base='cb_2015_us_state_20m'
dat_dir='../shapefiles/'+shp_file_base +'/'
sf = shapefile.Reader(dat_dir+shp_file_base)

The resulting shapefile object (sf in this case) contains records associated with each shape. I wasn’t sure what fields were included for the State Boundary shapefile from census.gov, so I opened up a Python shell in terminal, read in the shapefile then typed

>>> sf.fields

to get a list of available fields:

[('DeletionFlag', 'C', 1, 0), ['STATEFP', 'C', 2, 0], ['STATENS', 'C', 8, 0], ['AFFGEOID', 'C', 11, 0], ['GEOID', 'C', 2, 0], ['STUSPS', 'C', 2, 0], ['NAME', 'C', 100, 0], ['LSAD', 'C', 2, 0], ['ALAND', 'N', 14, 0], ['AWATER', 'N', 14, 0]]

Down towards the end, there’s an interesting entry

['ALAND', 'N', 14, 0]

Though I couldn’t find any documentation on the included fields, I suspected ALAND stood for land area (especially since it was followed by AWATER). So in read_shp_and_rcrd.py, the first thing I do is extract the field names and find the index corresponding the the land area:

""" Find max/min of record of interest (for scaling the facecolor)"""

# get list of field names, pull out appropriate index
# fieldnames of interest: ALAND, AWATER are land and water area, respectively
fld = sf.fields[1:]
field_names = [field[0] for field in fld]
fld_name='ALAND'
fld_ndx=field_names.index(fld_name)

I found this post helpful for extracting the fieldnames of each record.

Next, I loop over the records using the interRecords() object to find the minimum and maximum land area in order to scale the polygon colors:

# loop over records, track global min/max
maxrec=-9999
minrec=1e21
for rec in sf.iterRecords():
    if rec[4] != 'AK': # exclude alaska so the scale isn't skewed
       maxrec=np.max((maxrec,rec[fld_ndx]))
       minrec=np.min((minrec,rec[fld_ndx]))

maxrec=maxrec/1.0 # upper saturation limit

print fld_name,'min:',minrec,'max:',maxrec

I excluded Alaska (if rec[4] != ‘AK’:) so that the color scale wouldn’t be thrown off, and then I also scale the maximum (maxrec=maxrec/1.0) to adjust the color scale manually (more on this later).

Now that I know the max/min, I loop over each shape and (1) calculate the RGB value for each polygon using a linear scale between the max and min and then (2) plot a polygon for each shape (and all the parts of a shape) using that RGB value:

for shapeRec in sf.iterShapeRecords():
    # pull out shape geometry and records 
    shape=shapeRec.shape
    rec = shapeRec.record

    # select polygon facecolor RGB vals based on record value
    if rec[4] != 'AK':
         R = 1
         G = (rec[fld_ndx]-minrec)/(maxrec-minrec)
         G = G * (G<=1) + 1.0 * (G>1.0)
         B = 0
    else:
         R = 0
         B = 0
         G = 0

    # check number of parts (could use MultiPolygon class of shapely?)
    nparts = len(shape.parts) # total parts
    if nparts == 1:
       polygon = Polygon(shape.points)
       patch = PolygonPatch(polygon, facecolor=[R,G,B], edgecolor=[0,0,0], alpha=1.0, zorder=2)
       ax.add_patch(patch)
    else: # loop over parts of each shape, plot separately
       for ip in range(nparts): # loop over parts, plot separately
           i0=shape.parts[ip]
           if ip < nparts-1:
              i1 = shape.parts[ip+1]-1
           else:
              i1 = len(shape.points)

          # build the polygon and add it to plot 
          polygon = Polygon(shape.points[i0:i1+1])
          patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)
          ax.add_patch(patch)

plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()

One import thing not to miss is that on the first line, I loop over the iterShapeRecords iterable rather than using iterShapes. This is neccesary so that I have access to both shape geometry and the associated records, rather than just the shapes (iterShapes) or just the records (iterRecords).

Running the above code will produce the following map:

shapefile_us_colored_by_area

Because Texas is so much larger than the rest of the states, we don’t see much of a difference between the states. But we can adjust this by decreasing the max value using in the scaling. So after finding the max/min value, I set

maxrec=maxrec/2.0 # upper saturation limit

and end up with the following map that brings out more of the variation in the states’ land area (same map as in the very beginning of this post):

shapefile_us_colored_by_area_sat

Note that because I’m decreased the maxvalue for scaling, I had to ensure that the RGB value did not exceed 1, which is why I had the following lines limiting the green value (G):

    if rec[4] != 'AK':
         R = 1
         G = (rec[fld_ndx]-minrec)/(maxrec-minrec)
         G = G * (G<=1) + 1.0 * (G>1.0)

So that’s about it! That’s how you can read in a shapefile and plot polygons of each shape colored by some data (record) associated with each shape. There are plenty of more sophisticated ways to do this exercise, and I’ll be looking into some other shapefile Python libraries for upcoming posts.

Taxi Tracers: Speed and Hysteresis of NYC Cabs

In my previous post, I outlined how I took the NYC yellow cab trip record data and made some pretty maps of Manhattan. Given how many variables are tracked, there are many possible avenues of further exploration.  It turns out that the average speed of taxis holds some interesting trends and in this post I’ll describe some of these trends (code available at my GitHub page).

I started off with plotting average speed against time of day, hoping to see some trends related to rush hour. I took three months of yellow cab data (January, February and March 2016) and calculated the average speed of every ride using the pick up time, drop off time and distance traveled. I then calculated a 2D histogram of average speeds vs time of day:

speedhist2d

I also calculated the median speed (solid curve) and the first/third quartiles (dashed curve) at each time of day. As expected, rush hour is evident in the rapid drop in speed from 6-10.  I didn’t expect such a large peak in the early hours (around 5) and I also found it interesting that after the work day, speeds only gradually increase back to the maximum.

Next, I took the 2-D histogram, defined three representative time ranges and calculated 1-D histograms of the speed for those time ranges (left plot, below):

1dhistograms

These distributions are highly skewed, to varying degrees. The 5-6 time range (blue curve) sees a much wider distribution of speeds compared to the work day (9-18, green curve) and evening (20-24, red curve). Because of the skewness, I used the median rather than the mean in the 2-D histogram plot. Similarly, the standard deviation is not a good representation for skewed data, and so I plotted the first and third quartiles (the dashed lines in the 2-D plot).

The difference between the 1st and 3rd quartile (the interquartile range, or midspread), is a good measure of how the distribution of average cab speeds changes (plotted above, right, if the data were normally distributed, this plot would just be a plot of the standard deviation). Not only are taxis going faster on average around 5-6 am, but there’s a much wider variation in taxi speeds. Traffic speeds during the work day are much more uniform.

After playing with these plots for a bit, I got to thinking about how the speed of a taxi is only indirectly related to time of day! A taxi’s speed should depend on how much traffic is on the road (as well as the speed limit I suppose…), the time of day only reflects trends of when humans decide to get on the road. So if I instead plot the average speed vs the number of hired cabs, I get quite an interesting result:

speed_hyst_3_mo_a

Generally, the average speed decreases with increased number of cabs (increases with decreased cab count), but the changes are by no means linear. It’s easiest to read this plot by picking a starting time: let’s start at 6am, up at the top. At 6am, the average speed with about 1000 hired cabs on the road is about 23 mph. From 6am to 9am, the number of cabs increases, causing traffic to slow to about 12 mph. From 9am to 7pm, things get complicated (see the zoomed inset below), with the number of cabs and average speed bouncing around through the workday until 7pm, when the cab count drops and average speed rises in response. Eventually, the loop closes on itself, forming a sort of representative daily traffic cycle for Manhattan.

speed_hyst_3_mo_zoom

When I first looked at this, it immediately reminded me of hysteresis loops from all my thermodynamics classes. And so I started googling “Traffic Hysteresis” and it turns out there’s a whole field of study devoted to understanding hysteresis loops in traffic patterns in order to better design systems of roadways that can accommodate fluctuations in traffic volume (e.g., Zhang 1999, Geroliminis and Sun 2011Saberi and Mahmassani 2012 ). One of important differences between those studies and the brief exploration I’ve done here is that the above studies have counts of total traffic. The calculations here only consider the taxis floating in the larger stream of traffic. While these taxi tracers are still a good measure of average flow of traffic, the number of hired cabs at a given time is only a proxy for total traffic. A lot of the complicated behavior during the workday (9am-7pm) likely arises from  changes in traffic volume not accounted for like delivery trucks, commuters or buses. So in some ways it’s remarkable that I got a hysteresis loop at all!

The Code

I described most of the methodology for reading and manipulating the NYC yellow cab data in my previous post. The two main additions to the code are a new class in taxi_plotmod.py (binned_variable) and an algorithm for finding the number of hired taxis (find_N_unique_vs_t of taxi_plotmod.py). The binned_variable class returns a number of useful values including median, first quartile, third quartile and binned values. The method find_N_unique_vs_t loops over a given time range and counts the number of taxis that are hired during that time.

A few useful tips learned during the process:

PyTip! Calculating the first and third quartiles. This was quite simple using the numpy’s where method. The first quartile is defined as the median of all values lying below the median, so for an array of values (var2bin), the first quartile at a given time of day (designated by the i_bin index) can be calculated with:

firstquart[i_bin]=np.median(var2bin[np.where(var2bin<self.med[i_bin])]) 

The where command finds all the indeces with values less than the median at a given time of day, so var2bin[np.where(var2bin<self.med[i_bin])] selects all values that are less than the median (self.med[i_bin]) from which a new median is calculated.

PyTip! Converting to datetime from numpy’s datetime64. So at one point, I found myself needing to know the number of days between two dates. I had been importing the date of each taxi ride as a numpy datetime64 type to make selecting given date ranges with conditional statements easier. Both datetime and numpy’s datetime64 allow you to simply subtract two dates. The following two commands will give the number of days between two dates (assuming numpy has been imported as np and datetime as dt):

a=dt.datetime(2016,4,3)-dt.datetime(2015,3,1)
b=np.datetime64('2016-04-03')-np.datetime64('2015-03-01')

Both methods will give 399 days as the result, but I needed an integer value. The datetime64 method returned a weird I-don’t-know-what type of class that I couldn’t figure out how to convert to an integer value easily.  The easiest solution I found was to use the .astype method available with the datetime64 class:

date_end=np.datetime64('2016-04-03')
date_start=np.datetime('2015-03-01')
Ndates=date_end.astype(dt.datetime)-date_start.astype(dt.datetime)

Sp what’s happening here? Well date_end and date_start are both numpy datetime64 variables. date_end.astype(dt.datetime) takes the end date and casts it as a normal datetime value. So I do that for both the start and end date, take the difference and I’m left with Ndates as a datetime timedelta, which you can conveniently get an integer date range using

Ndates.days

So that’s that!

Stay tuned for either more taxi analysis or some fun with shapefiles!