[iframe src=”https://nbviewer.jupyter.org/github/chrishavlin/miscellaneous_python/blob/master/notebooks/NanMeans.ipynb” height=”1600″]
python data science
A primer on map projections in Python
Ok, so after my post a little while ago about converting contours to shapefiles, I spent a day adapting the approach for use in a in a dash-plotly app… and in general it works, and took a ~800 MB file down to a ~14 MB shapefile. But the rendering of the shapefile was still very slooooooow, and due to limitations of how data callbacks update a figure’s data in Dash, I couldn’t just render the background map once. So, after some thinking, I realized that if I “simply” calculated my own map projections, I could use the standard x-y plotting routines (which includes contours, as well as the faster scattergl method for plotting a large number of points).
But then I realized that I only vaguely know what map projections are: I knew different projections vary in how they portray the 3D earth on a 2D surface, but when you start using all the mapping libraries out there, you run into a lot of jargon really fast. So here’s a super basic primer on calculation map projections in Python that explains some of that jargon!
Also, this is the first post I decided to write as a Python notebook. The notebook is in my github learning_shapefiles repo, but github has been having some issues displaying notebooks lately, so you can view the full notebook here.
PyProj
To start off, pyproj (link), is a python library based on the command-line library https://proj.org/ that provides a large number of algorithms related to map projections. So go install that (simple to do with pip or conda).
In order to use pyproj to project a lat/lon point onto a map projection, it’s helpful to know a few acronyms:
- CRS: coordinate reference system
- EPSG: a database of coordinate reference systems and transformations, different projections covering different regional or local (or global) domains are specified by the number following EPSG. For example, EPSG:4326 is the global reference ellipse used by modern day GPS.
- WGS84 : World Geodetic System, latest revision, a.k.a. WGS1984, same as EPSG:4326
- reference ellipse: an ellipse describing the surface of the Earth on which positions (like latitude and longitude) are defined. GPS points use a reference ellipse that approximates the Earth’s geoid (i.e., the gravitational equipotential surface that sea-level would follow if due to gravity alone, see wiki).
So now that’s out of the way…. to project a single lat/lon point with Pyproj, first import pyproj then initialize a projection (Proj) object:
from pyproj import Proj p=Proj(proj='hammer',ellps='WGS84')
When initializing the Proj (projection) object, you give it the reference ellipse that your lat/lon is defined on, ellps=‘WGS84’, and the projection you want to transform to, in this case a hammer projection. Once you’ve initialized Proj, you can use it to move from lat/lon to cartesian x,y:
lon=-120.5 lat=42.4 x,y=p(lon,lat)
This lon/lat point become (x,y)=(-9894366.0792,5203184.81636). These values in meters by default and are the cartesian x-y distance from the map center for your point.
The Proj object is flexible enough to take lists (and numpy arrays!), so you can project many points at once. It can also take any of the parameters defined for the PROJ projections as keyword arguments. For example, the Hammer projection, has an argument ‘+lon_0’ to set the 0-longitude for the view, and so in pyProj you can use ‘lon_0’ as a keyword argument when using the hammer projection:
p = Proj(proj='hammer',ellps='WGS84',lon_0=-50)
The rest of the notebook goes through how you could construct a grid of lat/lon lines and plot them for different projections. And at the end, I take a digital elevation model of the western US (which gives an elevation for lat/lon points), project the lat/lon points onto a hammer projection to get points of (elevation,x,y) and contour it using the standard matplotlib.pyplot.contourf to get the following:
In my particular application, I’ll now be able to project my data to x,y and then use the cartesian contour functions of dash-plotly! The main drawback of this approach is having to write all the routines for handling rotations and plotting lines on maps… but ultimately I think it will work well with Dash, and it will be nice to be able use scattergl with my map point data.
Bikes! Part 0
Some months ago I discovered that there are a number of bike shares out there that make their data publicly available, and I’ve been meaning to download some of it and poke around. Well today I finally had some time. And though I’m not sure what I’ll be doing with this data set yet, I wanted to share a figure I made in my initial exploration.
The following figure shows the number of rides per day and the median ride distance for the Portland OR bike share (data from BikeTown: https://www.biketownpdx.com/system-data). I threw the code in a new github repo here so you can take a look if inerested, but I’m not going to go into detail yet (the code just downloads their system data files, concantenates the monthly files and does some minimal processing and plotting). In any case, the figure:
The neat (and maybe unsurprising) feature is the strong seasonality to bike share usage, both in terms of just the number of rides per day (high in summer, low in winter) and the median distance of each ride (longer rides in summer). There is an interesting spike in total rides around May 2018 — maybe excitement for springtime? or additional bikes added to the program? A plot of bike usage percent (total rides / available bikes) might be more illustrative.
So that’s that for now. Hopefully won’t be too long before I have time for some more in depth analysis.
Bikes!
Taxi Tracers: Speed and Hysteresis of NYC Cabs
In my previous post, I outlined how I took the NYC yellow cab trip record data and made some pretty maps of Manhattan. Given how many variables are tracked, there are many possible avenues of further exploration. It turns out that the average speed of taxis holds some interesting trends and in this post I’ll describe some of these trends (code available at my GitHub page).
I started off with plotting average speed against time of day, hoping to see some trends related to rush hour. I took three months of yellow cab data (January, February and March 2016) and calculated the average speed of every ride using the pick up time, drop off time and distance traveled. I then calculated a 2D histogram of average speeds vs time of day:
I also calculated the median speed (solid curve) and the first/third quartiles (dashed curve) at each time of day. As expected, rush hour is evident in the rapid drop in speed from 6-10. I didn’t expect such a large peak in the early hours (around 5) and I also found it interesting that after the work day, speeds only gradually increase back to the maximum.
Next, I took the 2-D histogram, defined three representative time ranges and calculated 1-D histograms of the speed for those time ranges (left plot, below):
These distributions are highly skewed, to varying degrees. The 5-6 time range (blue curve) sees a much wider distribution of speeds compared to the work day (9-18, green curve) and evening (20-24, red curve). Because of the skewness, I used the median rather than the mean in the 2-D histogram plot. Similarly, the standard deviation is not a good representation for skewed data, and so I plotted the first and third quartiles (the dashed lines in the 2-D plot).
The difference between the 1st and 3rd quartile (the interquartile range, or midspread), is a good measure of how the distribution of average cab speeds changes (plotted above, right, if the data were normally distributed, this plot would just be a plot of the standard deviation). Not only are taxis going faster on average around 5-6 am, but there’s a much wider variation in taxi speeds. Traffic speeds during the work day are much more uniform.
After playing with these plots for a bit, I got to thinking about how the speed of a taxi is only indirectly related to time of day! A taxi’s speed should depend on how much traffic is on the road (as well as the speed limit I suppose…), the time of day only reflects trends of when humans decide to get on the road. So if I instead plot the average speed vs the number of hired cabs, I get quite an interesting result:
Generally, the average speed decreases with increased number of cabs (increases with decreased cab count), but the changes are by no means linear. It’s easiest to read this plot by picking a starting time: let’s start at 6am, up at the top. At 6am, the average speed with about 1000 hired cabs on the road is about 23 mph. From 6am to 9am, the number of cabs increases, causing traffic to slow to about 12 mph. From 9am to 7pm, things get complicated (see the zoomed inset below), with the number of cabs and average speed bouncing around through the workday until 7pm, when the cab count drops and average speed rises in response. Eventually, the loop closes on itself, forming a sort of representative daily traffic cycle for Manhattan.
When I first looked at this, it immediately reminded me of hysteresis loops from all my thermodynamics classes. And so I started googling “Traffic Hysteresis” and it turns out there’s a whole field of study devoted to understanding hysteresis loops in traffic patterns in order to better design systems of roadways that can accommodate fluctuations in traffic volume (e.g., Zhang 1999, Geroliminis and Sun 2011, Saberi and Mahmassani 2012 ). One of important differences between those studies and the brief exploration I’ve done here is that the above studies have counts of total traffic. The calculations here only consider the taxis floating in the larger stream of traffic. While these taxi tracers are still a good measure of average flow of traffic, the number of hired cabs at a given time is only a proxy for total traffic. A lot of the complicated behavior during the workday (9am-7pm) likely arises from changes in traffic volume not accounted for like delivery trucks, commuters or buses. So in some ways it’s remarkable that I got a hysteresis loop at all!
The Code
I described most of the methodology for reading and manipulating the NYC yellow cab data in my previous post. The two main additions to the code are a new class in taxi_plotmod.py (binned_variable) and an algorithm for finding the number of hired taxis (find_N_unique_vs_t of taxi_plotmod.py). The binned_variable class returns a number of useful values including median, first quartile, third quartile and binned values. The method find_N_unique_vs_t loops over a given time range and counts the number of taxis that are hired during that time.
A few useful tips learned during the process:
PyTip! Calculating the first and third quartiles. This was quite simple using the numpy’s where method. The first quartile is defined as the median of all values lying below the median, so for an array of values (var2bin), the first quartile at a given time of day (designated by the i_bin index) can be calculated with:
firstquart[i_bin]=np.median(var2bin[np.where(var2bin<self.med[i_bin])])
The where command finds all the indeces with values less than the median at a given time of day, so var2bin[np.where(var2bin<self.med[i_bin])] selects all values that are less than the median (self.med[i_bin]) from which a new median is calculated.
PyTip! Converting to datetime from numpy’s datetime64. So at one point, I found myself needing to know the number of days between two dates. I had been importing the date of each taxi ride as a numpy datetime64 type to make selecting given date ranges with conditional statements easier. Both datetime and numpy’s datetime64 allow you to simply subtract two dates. The following two commands will give the number of days between two dates (assuming numpy has been imported as np and datetime as dt):
a=dt.datetime(2016,4,3)-dt.datetime(2015,3,1) b=np.datetime64('2016-04-03')-np.datetime64('2015-03-01')
Both methods will give 399 days as the result, but I needed an integer value. The datetime64 method returned a weird I-don’t-know-what type of class that I couldn’t figure out how to convert to an integer value easily. The easiest solution I found was to use the .astype method available with the datetime64 class:
date_end=np.datetime64('2016-04-03') date_start=np.datetime('2015-03-01') Ndates=date_end.astype(dt.datetime)-date_start.astype(dt.datetime)
Sp what’s happening here? Well date_end and date_start are both numpy datetime64 variables. date_end.astype(dt.datetime) takes the end date and casts it as a normal datetime value. So I do that for both the start and end date, take the difference and I’m left with Ndates as a datetime timedelta, which you can conveniently get an integer date range using
Ndates.days
So that’s that!
Stay tuned for either more taxi analysis or some fun with shapefiles!