In my previous post, I outlined how I took the NYC yellow cab trip record data and made some pretty maps of Manhattan. Given how many variables are tracked, there are many possible avenues of further exploration. It turns out that the average speed of taxis holds some interesting trends and in this post I’ll describe some of these trends (code available at my GitHub page).
I started off with plotting average speed against time of day, hoping to see some trends related to rush hour. I took three months of yellow cab data (January, February and March 2016) and calculated the average speed of every ride using the pick up time, drop off time and distance traveled. I then calculated a 2D histogram of average speeds vs time of day:
I also calculated the median speed (solid curve) and the first/third quartiles (dashed curve) at each time of day. As expected, rush hour is evident in the rapid drop in speed from 6-10. I didn’t expect such a large peak in the early hours (around 5) and I also found it interesting that after the work day, speeds only gradually increase back to the maximum.
Next, I took the 2-D histogram, defined three representative time ranges and calculated 1-D histograms of the speed for those time ranges (left plot, below):
These distributions are highly skewed, to varying degrees. The 5-6 time range (blue curve) sees a much wider distribution of speeds compared to the work day (9-18, green curve) and evening (20-24, red curve). Because of the skewness, I used the median rather than the mean in the 2-D histogram plot. Similarly, the standard deviation is not a good representation for skewed data, and so I plotted the first and third quartiles (the dashed lines in the 2-D plot).
The difference between the 1st and 3rd quartile (the interquartile range, or midspread), is a good measure of how the distribution of average cab speeds changes (plotted above, right, if the data were normally distributed, this plot would just be a plot of the standard deviation). Not only are taxis going faster on average around 5-6 am, but there’s a much wider variation in taxi speeds. Traffic speeds during the work day are much more uniform.
After playing with these plots for a bit, I got to thinking about how the speed of a taxi is only indirectly related to time of day! A taxi’s speed should depend on how much traffic is on the road (as well as the speed limit I suppose…), the time of day only reflects trends of when humans decide to get on the road. So if I instead plot the average speed vs the number of hired cabs, I get quite an interesting result:
Generally, the average speed decreases with increased number of cabs (increases with decreased cab count), but the changes are by no means linear. It’s easiest to read this plot by picking a starting time: let’s start at 6am, up at the top. At 6am, the average speed with about 1000 hired cabs on the road is about 23 mph. From 6am to 9am, the number of cabs increases, causing traffic to slow to about 12 mph. From 9am to 7pm, things get complicated (see the zoomed inset below), with the number of cabs and average speed bouncing around through the workday until 7pm, when the cab count drops and average speed rises in response. Eventually, the loop closes on itself, forming a sort of representative daily traffic cycle for Manhattan.
When I first looked at this, it immediately reminded me of hysteresis loops from all my thermodynamics classes. And so I started googling “Traffic Hysteresis” and it turns out there’s a whole field of study devoted to understanding hysteresis loops in traffic patterns in order to better design systems of roadways that can accommodate fluctuations in traffic volume (e.g., Zhang 1999, Geroliminis and Sun 2011, Saberi and Mahmassani 2012 ). One of important differences between those studies and the brief exploration I’ve done here is that the above studies have counts of total traffic. The calculations here only consider the taxis floating in the larger stream of traffic. While these taxi tracers are still a good measure of average flow of traffic, the number of hired cabs at a given time is only a proxy for total traffic. A lot of the complicated behavior during the workday (9am-7pm) likely arises from changes in traffic volume not accounted for like delivery trucks, commuters or buses. So in some ways it’s remarkable that I got a hysteresis loop at all!
I described most of the methodology for reading and manipulating the NYC yellow cab data in my previous post. The two main additions to the code are a new class in taxi_plotmod.py (binned_variable) and an algorithm for finding the number of hired taxis (find_N_unique_vs_t of taxi_plotmod.py). The binned_variable class returns a number of useful values including median, first quartile, third quartile and binned values. The method find_N_unique_vs_t loops over a given time range and counts the number of taxis that are hired during that time.
A few useful tips learned during the process:
PyTip! Calculating the first and third quartiles. This was quite simple using the numpy’s where method. The first quartile is defined as the median of all values lying below the median, so for an array of values (var2bin), the first quartile at a given time of day (designated by the i_bin index) can be calculated with:
The where command finds all the indeces with values less than the median at a given time of day, so var2bin[np.where(var2bin<self.med[i_bin])] selects all values that are less than the median (self.med[i_bin]) from which a new median is calculated.
PyTip! Converting to datetime from numpy’s datetime64. So at one point, I found myself needing to know the number of days between two dates. I had been importing the date of each taxi ride as a numpy datetime64 type to make selecting given date ranges with conditional statements easier. Both datetime and numpy’s datetime64 allow you to simply subtract two dates. The following two commands will give the number of days between two dates (assuming numpy has been imported as np and datetime as dt):
Both methods will give 399 days as the result, but I needed an integer value. The datetime64 method returned a weird I-don’t-know-what type of class that I couldn’t figure out how to convert to an integer value easily. The easiest solution I found was to use the .astype method available with the datetime64 class:
date_end=np.datetime64('2016-04-03') date_start=np.datetime('2015-03-01') Ndates=date_end.astype(dt.datetime)-date_start.astype(dt.datetime)
Sp what’s happening here? Well date_end and date_start are both numpy datetime64 variables. date_end.astype(dt.datetime) takes the end date and casts it as a normal datetime value. So I do that for both the start and end date, take the difference and I’m left with Ndates as a datetime timedelta, which you can conveniently get an integer date range using
So that’s that!
Stay tuned for either more taxi analysis or some fun with shapefiles!