Today I found myself needing to plot the physiographic boundary of the Colorado Plateau in Python. It’s been a while since I’ve touched on shapefiles (or anything on the blog) so I figured I’d write a quick blurb on reading and plotting this particular shapefile.
What you’ll learn: reading shapefiles, merging polygon shapes in Python with shapely
The Data
The first challenge was finding the actual lat/lon coordinates defining the edge of the Colorado Plateau… it’s amazing how many papers in geology/geophysics plot the boundary but don’t actually reference where the heck they got their coordinates from. After much digging I FINALLY found a paper that actually cited their source: Hopper and Fischer 2018 [2] reference a 1946 publication by Fenneman and Johnson [1] titled “Physiographic divisions of the conterminous U. S.” and after a quick search I found the digitized data from that publication online at water.usgs.gov.
The dataset contains a large number of physiographic regions and the Colorado Plateau is subdivided into multiple regions, so the code below pulls out the regions within the Colorado Plateau and joins them into a single shape defining the full boundary. To run the code below, unpack physio_shp.zip wherever you downloaded it to and rename the folder to physio (to match expectations for the pyshp shapefile reader).
The XML data for the shapefile defines a province code for different provinces, for which the Colorado Plateau sub-regions have a value of 21. So the code (1) reads the shapefiles, (2) finds the shapes with a province code of 21 and (3) combines them.
Step 1: imports, reading arguments, reading the shapefile.
shapefile is the library for pyshp, otherwise pretty self explanatory:
import shapefile, os,sys
import matplotlib.pyplot as plt
from shapely.geometry import Polygon
from shapely.ops import cascaded_union
# read the arguments
fname=sys.argv[1] # path to physio.shp
outfolder=None
if len(sys.argv)>2:
outfolder=sys.argv[2] # folder to store output
# read the shapefile
sf = shapefile.Reader(fname)
Step 2: Find the Colorado Plateau shapes.
The shapes are described in the records list of the shapefile object:
sf.records()
records() is a list of attributes for each shape and a single record looks like
The final value is the province code — so we just need to save off the indeces for which that value is 21. It turns out the 3rd value in the record list is actually a cross-reference to a shape ID, but for some reason the indexing is offset by 2 when reading this shapefile with python. So the shape data for this shape would be accessed with:
sf.shapes()[218]
rather than 220. Not sure why it’s off by 2 (would expect it to be off by 1 due to python indexing), but in any case, my code simply records the list index as python sees it:
# find the record indeces for colorado plateau (province ID = 21)
i_rec=0
recs_to_plot=[]
for rec in sf.records():
if rec[-1]==21:
print(rec)
print(i_rec)
recs_to_plot.append(i_rec)
i_rec=i_rec+1
# plot the individual records
plt.subplot(1,2,1)
for rec in recs_to_plot:
pts=sf.shapes()[rec].points
lons=[pt[0] for pt in pts]
lats=[pt[1] for pt in pts]
plt.plot(lons,lats,'.k')
As seen above — the coordinates for the shape boundaries for a record are in
sf.shapes()[rec].points
which is a list of longitude and latitude points (which the code unpacks for plotting). This section of code will generate the following outline of the Colorado Plateau regions:
Step 3: merging shapes
This is the fun bit! What we want is just the outer boundary of the union of all the shapes. The python library shapely lets us do this very easily by creating a list of shapely Polygon objects then combining them with the cascaded_union method:
# create a single shape for Colorado Plateau from union of sub-shapes
polies=[]
for rec in recs_to_plot:
polies.append(Polygon(sf.shapes()[rec].points))
CP_bound=cascaded_union(polies)
# plot the exterior shape
lon,lat = CP_bound.exterior.xy
plt.subplot(1,2,2)
plt.plot(lon,lat,'.k')
and the resulting plot of just the exterior boundary:
Step 4: output the processed data
The code also exports the lat/lon points defining that exterior boundary with:
# export the final shape as a CSV of boundary points
if outfolder is not None:
f=open(os.path.join(outfolder,'ColoradoPlateauBoundary.csv'),'w')
f.write("lon,lat\n")
for i,j in zip(lon,lat):
f.write(str(i)+","+str(j)+"\n")
f.close()
I could have written some code to save the data in a shapefile format, but for such a small amount of data I find it easier to save a CSV and just create a Polygon from the list of points as I need it. I’m actually planning to create a Polygon that will be combined with GeoPandas to find sets of points falling within the plateau (GeoPandas lets you do database joins on geospatial data, it’s awesome!).
where the first argument is the path to the downloaded and unpacked shapefile and the second argument is the location to save the CSV file (this argument is optional — no data will be saved if not included).
[2] Hopper, E., & Fischer, K. M. (2018), The changing face of the lithosphere-asthenosphere boundary: Imaging continental scale patterns in upper mantle structure across the contiguous U.S. with Sp converted waves. Geochemistry, Geophysics, Geosystems, 19 , 2 593 – 2 614 . https://doi.org/10. 1029/2018GC007476
In my recent coding exploits, I’ve downloaded lots of different shapefiles. Most shapefiles were accompanied by nice .xml documenation with information about the data and how its stored or labeled, but a few had hardly any information at all. I knew the general content based on the description from the website were I downloaded the shapefile, but I didn’t know what they had used for the record labels and I didn’t know what the record values were exactly. So the past couple days I sat down and wrote a bit of code to help in unraveling a myserious shapefile…
The program is fairly straightforward. It traverses the records of a shapefile, recording the record label (or “field names” as I refer to them in the source) and information about each record. One of the program’s methods uses the Python XML API called ElementTree to produce an xml file that you can load in a browser. Here’s a screen shot from using Firefox to view the xml file produced when running the program on the Open Street Map shapefile that I extracted via MapZen for my previous post.
In a browser, you can shrink or expand the xml attributes to get some basic information about each record: the name or label of the records, the data type and some sort of sample of the data. If the record data is an integer or float, then the sample will be the min/max values of the record while if it’s a string, it will either be a list of the unique strings in the records or just a sample of some of the strings. The OpenStreetMap shapefile contained some record values that were keywords, like the “highway” attribute in the screen shot above. While other records were strings with unique values for each shape, like the “name” attribute below:
In addition to generating an xml file, the program allows you to interactively explore a field.
When you run the program from command line (type in python inspect_shapefile.py in the src directory), it’ll ask for your input. It first asks if you want to give it a shapefile, here I said no and used the shapefile hardwired into __main__ of inspect_shapefile.py:
Do you want to enter the path to a shapefile? (Y/N) N
Using shapefile specified in __main__ :
directory: ../../learning_shapefiles/shapefiles/denver_maps/grouped_by_geometry_type/
filename: ex_QMDJXT8DzmqNh6eFiNkAuESyDNCX_osm_line
Loading shapefile ...
... shapefile loaded!
It then pulls out all the fields in the shapefile records, displays them and asks what you want to do. This is what it looks like using the OpenStreetMaps shapefile:
Shapefile has the following field names
['osm_id', 'access', 'aerialway', 'aeroway', 'amenity', 'area', 'barrier', 'bicycle',
'brand', 'bridge', 'boundary', 'building', 'covered', 'culvert', 'cutting', 'disused',
'embankment', 'foot', 'harbour', 'highway', 'historic', 'horse', 'junction', 'landuse',
'layer', 'leisure', 'lock', 'man_made', 'military', 'motorcar', 'name', 'natural',
'oneway', 'operator', 'population', 'power', 'place', 'railway', 'ref', 'religion',
'route', 'service', 'shop', 'sport', 'surface', 'toll', 'tourism', 'tower:type',
'tracktype', 'tunnel', 'water', 'waterway', 'wetland', 'width', 'wood', 'z_order',
'way_area', 'tags']
Do you want to investigate single field (single)? Generate xml
file (xml)? Or both (both)? single
Enter field name to investigate: landuse
So you can see all these different fields. I chose to look at a single field (“landuse”) and the program will then look at the “landuse” record value for each shape, record its data type and save new record values:
searching for non-empty entry for landuse ...
data type found: str
Finding unique record values for landuse
1 of 212550 shapes ( 0.0 % )
new record value:
93 of 212550 shapes ( 0.04 % )
new record value: reservoir
6782 of 212550 shapes ( 3.19 % )
new record value: residential
110432 of 212550 shapes ( 51.95 % )
new record value: grass
111094 of 212550 shapes ( 52.26 % )
new record value: construction
Completed field name inspection
---------------------------------------
Shapefile has the following field names
['osm_id', 'access', 'aerialway', 'aeroway', 'amenity', 'area',
'barrier', 'bicycle', 'brand', 'bridge', 'boundary', 'building',
'covered', 'culvert', 'cutting', 'disused', 'embankment', 'foot',
'harbour', 'highway', 'historic', 'horse', 'junction', 'landuse',
'layer', 'leisure', 'lock', 'man_made', 'military', 'motorcar',
'name', 'natural', 'oneway', 'operator', 'population', 'power',
'place', 'railway', 'ref', 'religion', 'route', 'service', 'shop',
'sport', 'surface', 'toll', 'tourism', 'tower:type', 'tracktype',
'tunnel', 'water', 'waterway', 'wetland', 'width', 'wood', 'z_order',
'way_area', 'tags']
The field name landuse is str
and has 5 unique values
Display Values? (Y/N) Y
possible values:
['', 'reservoir', 'residential', 'grass', 'construction']
As you can see from the output, there were 4 keywords (reservoir, residential, grass and construction) used to describe the ‘landuse’ field. So I could now write some code to go into a shapefile and extract only the shapes that have a ‘residential’ value for ‘landuse.’ But I couldn’t do that until I (1) knew that the landuse field existed and (2) knew the different definitions for landuse type.
So there it is! That’s the program. Hopefully all the shapefiles you ever download will be well-documented. But if you find one that’s not and you really need to figure it out, this little tool might help!
Some code notes and tips
The xml file that I create didn’t follow any particular standard or convention, just what I thought might be useful. Perhaps that could be improved?
REMEMBER THAT IN PYTHON, YOU NEED TO EXPLICITLY COPY LISTS! I stupidly forgot that when you make a list
And then want to make a copy of the list, if you do this:
list_b = list_a
Then any changes to list_b will change list_a. But if you do
list_b = list_a[:]
You’ll get a new copy that won’t reference back to list_a. This is probably one of the things that I forget most frequently with Python lists. Palm-smack-to-forehead.
The XML API ElementTree was pretty great to work with. You can very easily define a hierarchy that will produce a nice xml tree (see this example). I did, however, have some trouble parsing the direct output from the type() function. When you calculate a type,
type(0.01)
you get this:
<type 'float'>
When I gave it directly to ElementTree (imported as ET here), like this:
I would get some errors because of the quotation marks enclosed. To get around this, I converted the type output to a string, split it up by the quotes and took the index that would just be the type (int, str, or float):
In case you don’t care about the Python and are just curious about the end product, here’s the final map where the color of each state reflects its total land area:
It’s kind of neat to see the gradient of state size from east to west, reflecting the historical expansion of the U.S. westward, but other than that, there’s not much to the map. But it does serve as a simple case for learning to manipulate shapefiles.
The code starts out the same as basic_read_plot.py, but now also imports Polygon and PolygonPatch from shapely and descartes, before reading in the shapefile:
import shapefile
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import Polygon
from descartes.patch import PolygonPatch
"""
IMPORT THE SHAPEFILE
"""
shp_file_base='cb_2015_us_state_20m'
dat_dir='../shapefiles/'+shp_file_base +'/'
sf = shapefile.Reader(dat_dir+shp_file_base)
The next part of the code plots a single geometry from the shapefile. This is super easy because shapefile.Reader reads a shapefile geometry as a list of points, which is exactly what the Polygon function needs. So we can just give that list of points directly to the Polygon function:
plt.figure()
ax = plt.axes()
ax.set_aspect('equal')
shape_ex = sf.shape(5) # could break if selected shape has multiple polygons.
# build the polygon from exterior points
polygon = Polygon(shape_ex.points)
patch = PolygonPatch(polygon, facecolor=[0,0,0.5], edgecolor=[0,0,0], alpha=0.7, zorder=2)
ax.add_patch(patch)
# use bbox (bounding box) to set plot limits
plt.xlim(shape_ex.bbox[0],shape_ex.bbox[2])
plt.ylim(shape_ex.bbox[1],shape_ex.bbox[3])
And we get Washington, now as a colored polygon rather than an outline:
Woo!
And as before, we can now loop over each shape (and each part of each shape), construct a polygon and plot it:
""" PLOTS ALL SHAPES AND PARTS """
plt.figure()
ax = plt.axes() # add the axes
ax.set_aspect('equal')
icolor = 1
for shape in list(sf.iterShapes()):
# define polygon fill color (facecolor) RGB values:
R = (float(icolor)-1.0)/52.0
G = 0
B = 0
# check number of parts (could use MultiPolygon class of shapely?)
nparts = len(shape.parts) # total parts
if nparts == 1:
polygon = Polygon(shape.points)
patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)
ax.add_patch(patch)
else: # loop over parts of each shape, plot separately
for ip in range(nparts): # loop over parts, plot separately
i0=shape.parts[ip]
if ip < nparts-1:
i1 = shape.parts[ip+1]-1
else:
i1 = len(shape.points)
polygon = Polygon(shape.points[i0:i1+1])
patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)
ax.add_patch(patch)
icolor = icolor + 1
plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()
In order to distinguish each polygon, I set each shape’s color based on how many shapes have already been plotted:
R = (float(icolor)-1.0)/52.0
This grades the red scale in an RGB tuple between 0 and 1 (since there are 52 shapes), and it is then used in the facecolor argument of PolygonPatch. The coloring is simply a function of the order in which the shapes are accessed:
The goal, however, is to color each polygon by some sort of data so that we can actually learn something interesting, and that is exactly what read_shp_and_rcrd.py does.
read_shp_and_rcrd.py
Up to now, we’ve only considered the shape geometry, but that is only one part of a shapefile. Also included in most shapefiles are the records, or the data, associated with each shape. When a shapefile is imported,
shp_file_base='cb_2015_us_state_20m'
dat_dir='../shapefiles/'+shp_file_base +'/'
sf = shapefile.Reader(dat_dir+shp_file_base)
The resulting shapefile object (sf in this case) contains records associated with each shape. I wasn’t sure what fields were included for the State Boundary shapefile from census.gov, so I opened up a Python shell in terminal, read in the shapefile then typed
Down towards the end, there’s an interesting entry
['ALAND', 'N', 14, 0]
Though I couldn’t find any documentation on the included fields, I suspected ALAND stood for land area (especially since it was followed by AWATER). So in read_shp_and_rcrd.py, the first thing I do is extract the field names and find the index corresponding the the land area:
""" Find max/min of record of interest (for scaling the facecolor)"""
# get list of field names, pull out appropriate index
# fieldnames of interest: ALAND, AWATER are land and water area, respectively
fld = sf.fields[1:]
field_names = [field[0] for field in fld]
fld_name='ALAND'
fld_ndx=field_names.index(fld_name)
I found this post helpful for extracting the fieldnames of each record.
Next, I loop over the records using the interRecords() object to find the minimum and maximum land area in order to scale the polygon colors:
# loop over records, track global min/max
maxrec=-9999
minrec=1e21
for rec in sf.iterRecords():
if rec[4] != 'AK': # exclude alaska so the scale isn't skewed
maxrec=np.max((maxrec,rec[fld_ndx]))
minrec=np.min((minrec,rec[fld_ndx]))
maxrec=maxrec/1.0 # upper saturation limit
print fld_name,'min:',minrec,'max:',maxrec
I excluded Alaska (if rec[4] != ‘AK’:) so that the color scale wouldn’t be thrown off, and then I also scale the maximum (maxrec=maxrec/1.0) to adjust the color scale manually (more on this later).
Now that I know the max/min, I loop over each shape and (1) calculate the RGB value for each polygon using a linear scale between the max and min and then (2) plot a polygon for each shape (and all the parts of a shape) using that RGB value:
for shapeRec in sf.iterShapeRecords():
# pull out shape geometry and records
shape=shapeRec.shape
rec = shapeRec.record
# select polygon facecolor RGB vals based on record value
if rec[4] != 'AK':
R = 1
G = (rec[fld_ndx]-minrec)/(maxrec-minrec)
G = G * (G<=1) + 1.0 * (G>1.0)
B = 0
else:
R = 0
B = 0
G = 0
# check number of parts (could use MultiPolygon class of shapely?)
nparts = len(shape.parts) # total parts
if nparts == 1:
polygon = Polygon(shape.points)
patch = PolygonPatch(polygon, facecolor=[R,G,B], edgecolor=[0,0,0], alpha=1.0, zorder=2)
ax.add_patch(patch)
else: # loop over parts of each shape, plot separately
for ip in range(nparts): # loop over parts, plot separately
i0=shape.parts[ip]
if ip < nparts-1:
i1 = shape.parts[ip+1]-1
else:
i1 = len(shape.points)
# build the polygon and add it to plot
polygon = Polygon(shape.points[i0:i1+1])
patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)
ax.add_patch(patch)
plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()
One import thing not to miss is that on the first line, I loop over the iterShapeRecords iterable rather than using iterShapes. This is neccesary so that I have access to both shape geometry and the associated records, rather than just the shapes (iterShapes) or just the records (iterRecords).
Running the above code will produce the following map:
Because Texas is so much larger than the rest of the states, we don’t see much of a difference between the states. But we can adjust this by decreasing the max value using in the scaling. So after finding the max/min value, I set
maxrec=maxrec/2.0 # upper saturation limit
and end up with the following map that brings out more of the variation in the states’ land area (same map as in the very beginning of this post):
Note that because I’m decreased the maxvalue for scaling, I had to ensure that the RGB value did not exceed 1, which is why I had the following lines limiting the green value (G):
if rec[4] != 'AK':
R = 1
G = (rec[fld_ndx]-minrec)/(maxrec-minrec)
G = G * (G<=1) + 1.0 * (G>1.0)
So that’s about it! That’s how you can read in a shapefile and plot polygons of each shape colored by some data (record) associated with each shape. There are plenty of more sophisticated ways to do this exercise, and I’ll be looking into some other shapefile Python libraries for upcoming posts.
I recently started a couple of projects that will involve using shapefiles and I got frustrated real fast. Many tutorials that I found assumed some previous knowledge of either shapefiles or the python libraries used to manipulate them. But what I wanted was a tutorial that helped me to plot a simple shapefile while getting to know what a shapefile actually is!
So here’s a SUPER simple example of how to load, inspect and plot a shapefile to make a map of the U.S! There are quite a few Python libraries dealing with shapefiles and it was hard to find the easiest place to start. I found the pyshp Python library the most approachable, so that’s what I use in the following example. There are many ways to visualize shapefiles in a more automated way than I do here, but I think that my approach here gives a clearer picture to a beginner of what a shapefile is and how to use Python with shapefiles.
The shapefile
Go get yourself a shapefile! The one I used (which will definitely work with my code below) is the lowest resolution state-level cartographic boundary shapefile from census.gov (link to census.gov,direct link to lowest resolution 20m .zip file). Once you download the .zip file, unpack it and take a look inside. A shapefile is actually a collection of different files, including a .shp file containing information on shape geometry (state boundaries in this case), a .dbf file containing attributes of each shape (like the name of each state) and others (check out the wiki page on shapefiles for a description of the other file extensions).
The code!
You can download my Python code: https://github.com/chrishavlin/learning_shapefiles
At present, the src folder includes only one python script: basic_read_plot.py. To run this script you will need to:
install the pyshp Python library (and numpy and matplotlib if you don’t have them already)
edit the variables in the source code describing the path to the shapefile (dat_dir and shp_file_base in src/basic_read_plot.py)
After those two steps, just open up a terminal and run the script (assuming you’re in the src directory):
$ python basic_read_plot.py
The three plots described below should pop up.
So what does the code do?
After the initial comment block and library import, the code reads in the shapefile using the string variables that give the location of the shapefile directory (data_dir) and the name of the shapefile without extension (shp_file_base):
sf = shapefile.Reader(dat_dir+shp_file_base)
This creates a shapefile object, sf, and the next few lines do some basic inspections of that object. To check how many shapes have been imported:
print 'number of shapes imported:',len(sf.shapes())
For the census.gov state boundary shapefile, this returns 52 for the 50 states, Washington D.C. and Puerto Rico.
For each shape (or state), there are a number of attributes defined: bbox, parts, points and shapeType. The pyshp documentation describes each, and I’ll touch on each one in the following (except for shapeType).
The first thing I wanted to do after importing the shapefile was just plot a single state. So I first pull out the information for a single shape (in this case, the 5th shape):
shape_ex = sf.shape(5)
The points attribute contains a list of latitude-longitude values that define the shape (state) boundary. So I loop over those points to create an array of longitude and latitude values that I can plot. A single point can be accessed with shape_ex.points[0] and will return a lon/lat pair, e.g. (-70.13123,40.6210). So I pull out the first and second index and put them in pre-defined numpy arrays:
x_lon = np.zeros((len(shape_ex.points),1))
y_lat = np.zeros((len(shape_ex.points),1))
for ip in range(len(shape_ex.points)):
x_lon[ip] = shape_ex.points[ip][0]
y_lat[ip] = shape_ex.points[ip][1]
And then I plot it:
plt.plot(x_lon,y_lat,'k')
# use bbox (bounding box) to set plot limits
plt.xlim(shape_ex.bbox[0],shape_ex.bbox[2])
This returns the state of Oregon! I also used the bbox attribute to set the x limits of the plot. bbox contains four elements that define a bounding box using the lower left lon/lat and upper right lon/lat. Since I’m setting the axes aspect ratio equal here, I only define the x limit.
Great! So all we need now is to loop over each shape (state) and plot it! Right? Well this code snippet does just that:
plt.figure()
ax = plt.axes()
ax.set_aspect('equal')
for shape in list(sf.iterShapes()):
x_lon = np.zeros((len(shape.points),1))
y_lat = np.zeros((len(shape.points),1))
for ip in range(len(shape.points)):
x_lon[ip] = shape.points[ip][0]
y_lat[ip] = shape.points[ip][1]
plt.plot(x_lon,y_lat)
plt.xlim(-130,-60)
plt.ylim(23,50)
And we can see some problems with the result:
The issue is that in some of the shapes (states), the geometry has multiple closed loops (because of the islands in some states), so simply connecting the lat/lon points creates some weird lines.
But it turns out that the parts attribute of each shape includes information to save us! For a single shape the parts attribute (accessed with shape.parts) contains a list of indeces corresponding to the start of a new closed loop within a shape. So I modified the above code to first check if there are any closed loops (number of parts > 1) and then loop over each part, pulling out the correct index range for each segment of geometry:
plt.figure()
ax = plt.axes() # add the axes
ax.set_aspect('equal')
for shape in list(sf.iterShapes()):
npoints=len(shape.points) # total points
nparts = len(shape.parts) # total parts
if nparts == 1:
x_lon = np.zeros((len(shape.points),1))
y_lat = np.zeros((len(shape.points),1))
for ip in range(len(shape.points)):
x_lon[ip] = shape.points[ip][0]
y_lat[ip] = shape.points[ip][1]
plt.plot(x_lon,y_lat)
else: # loop over parts of each shape, plot separately
for ip in range(nparts): # loop over parts, plot separately
i0=shape.parts[ip]
if ip < nparts-1:
i1 = shape.parts[ip+1]-1
else:
i1 = npoints
seg=shape.points[i0:i1+1]
x_lon = np.zeros((len(seg),1))
y_lat = np.zeros((len(seg),1))
for ip in range(len(seg)):
x_lon[ip] = seg[ip][0]
y_lat[ip] = seg[ip][1]
plt.plot(x_lon,y_lat)
plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()
And we can see those spurious lines are now gone:
Final Thoughts
Now that I feel pretty good about the information contained in a shapefile and how it’s stored, I’ll be moving on to more exciting visualizations. It’s important to note, that there are many Python libraries that can plot shapefiles without manually pulling out the points as I’ve done here. But I feel much better about using those fancier approaches now that I’ve gone through this exercise.
Also, in this post I’ve only touched on the geometry information in a shapefile. But it’s really the records included in the .dbf files that will make this an interesting visualization. The records contain measurements, observations or descriptions for each shape and that information can be used to color or fill each shape to create visualizations like this one (not my work).
In their recent article, “Spatializing 6 ,000 years of global urbanization from 3700 BC to AD 2000”, Reba et al. describe efforts to create a digital, geocoded dataset tracking the distribution of urban locations since 3700 BC. The digital database provides a record of human movement over the geologically recent past and is useful for understanding the forces that drive human societies towards urbanization. The database seemed like a fun test of my fledgling python skills and in the post here, I’ll describe a visualization of Reba et al.’s database. Reba et al. released their data here. My python source is available vi GitHub here. As this is my first actual post on the blog here, let me remind you that I’m something of a python noob. So if you actually check out the full source code, I expect you’ll find lots of mistakes and perhaps some poor choices. You’re welcome to let me know of all the ways I could improve the code in the comments section below.
Before jumping into what I actually did, a bit more on the dataset. Reba et al. (2016) created their new digital database from two earlier compilations, Chandler (1984) and Modelski (2000,2003). The authors of those previous studies meticulously scoured archaelogical and historical records in search of locations of urban centers. Reba et al. took those datasets and created a .csv file listing each city’s name, latitude, longitude, population estimate and a time corresponding to the population estimate, a process that involved transcribing the original databases manually (ouch!! none of the available automated print-to-digital methods worked) and geocoding each entry to obtain a latitude and longitude. In the end, Reba et al. ended up with three digital .csv datasets: Chandler’s database for urban centers from 2250 BC to 1975 AD, Modelski’s ancient city database covering 3700 BC to 1000 AD and Modelski’s modern 2000 AD database. All told, there are just over 2000 unique cities recorded between the three databases, many of which have multiple population estimates through time.
It’s worth noting that the historical database is by no means complete. As Reba et al. discuss in detail, omissions of urban centers from the original Chandler and Modelski databases, unclear entries in the original databases or errors in transcription would all result in missing cities. South Asian, South American, North American, and African cities are particularly under-represented. As a geologist, I’m used to incomplete records. Interpreting a fossil record, a regional sedimentary sequence or structural juxtaposition often requires some interpolation. A given rock unit may be preserved in one location while it is eroded and lost to knowledge in another. Thus, the completeness of any (pre)historical dataset depends on both preservation and sampling – there could be cities missing because the local climate did not preserve abandoned structures or simply because archaeology is a relatively young pursuit and excavation efforts have traditionally focused on a small fraction of the Earth’s surface. But as Reba et al. state “These data alone are not accurate global representations of all population values through time. Rather, it highlights the population values of important global cities during important time periods.”
I was pretty impressed by Reba et al.’s work and decided their dataset would provide an interesting context to improve my python. So I set out to write some python code capable of importing the database and making some pretty plots (source code here). Note that I do not distribute Reba et al.’s database with my source, you’ll have to download that separately from their site here. See the README file in the source code for a list of other dependencies required to run the code, which was only tested with python 2.7.
Before digging too deeply into the code, let’s just start with the end product. Here’s an animation of Chandler’s data set from 2250 BC to present day. Each circle is an urban center and the color of circles changes from blue to red as the time approaches present day.
Pretty neat! We can see clustering and expansion of early urban centers around Mesopotamia, then seemingly separate loci of urban development popping up in East Asia and Central America. Along South America, it’s interesting how urban centers trace out the coastline. And as the animation approaches present day, the number of urban centers explodes (from 1960 to 2014, the percentage of the world’s population living in urban settings increased from 34% to 54%).
In addition to animations, the python source can plot a single frame for a user-specified time range. Here are the entries for Modelski’s Ancient Cities database from 500 BC to 50 AD:
The Code
The three main steps to producing the above animation were (1) import the data, (2) subsample the data and (3) plot the data. The module urbanmap.py includes all functions needed to reproduce the above figures. And the scripts ex_animate.py and ex_single_frame.py are examples that call the appropriate functions to create the above animation and single frame plot.
In the following, I’ll walk through the different steps and their related functions in urbanmap.py. Again, full source code is here.
(1) Importing the data
The first step is to actually read in some data! The function urbanmap.load_cities does just that for a specified dataset. The directory where the dataset is located is specified and the name of the dataset are given by the data_dir and city_file argument, respectively:
39 def load_cities(data_dir,city_file): 40 """ loads population, lat/lon and time arrays for historical 41 city data. """
The function works with any of the three original plain text, comma separated valued (CSV) files from Reba et al.: chandlerV2.csv, modelskiAncientV2.csv and modelskiModernV2.csv.
Each .csv file has a row for each city, where the columns are the City Name, Alternate City Name, Country, Latitude,Longitude, Certainty and Population. So first, I open up the .csv file and create a csv object using the CSV reader:
44 # load city files
45 flnm=data_dir+city_file
46 fle = open(flnm, 'r') # open the file for reading
47 csv_ob=csv.reader(fle,skipinitialspace=True)
Some of the Alternate City Names have commas within quotes, which causes those entries to split. Adding the second argument (skipinitialspace=True) to csv.reader prevents those commas within quotes from being read as a new comma-separated value.
The remainder of the function reformats the data into arrays that I find more flexible to work with. First, I generate an array called Time, which contains every time for which a population record exists. In the original .csv files, the header line of each Population column gives the time at which the population estimate corresponds to. The header values are strings, such as BC_1400, BC_1390,…,AD_100,AD_110,AD_1000… So the first thing I do is convert these header strings to a 1D numpy array where each element of the array is the time in years before present (ybp).
57 # get header line
58 header = fle.next()
59 header = header.rstrip()
60 header = header.split(',')
61
62 # build the Time array 63 nt = len(header) 64 Time_string=header[6:nt] 65 nt = len(Time_string) 66 Time = np.zeros((nt,1)) 67 68 # convert BC/AD to years before present 69 for it in range(0,nt): 70 ct = Time_string[it] 71 ct = ct.split('_') 72 Time[it]=bc_ad_ce_to_ybp(ct[1],ct[0])
Why go through all this trouble? Well, let’s say I want all cities with a recorded population for the time range 1400 BC to 50 AD. If I knew the header values exactly, I could find the indeces in Time_string corresponding to BC_1400 and AD_50. But the headers aren’t uniform within a single .csv file or across .csv files. The way I’ve constructed the Time array, however, allows for straightforward conditional indexing. The usefulness becomes more apparent after reading in the remaining data and I describe it in section 2 below.
The next lines (lines 74-101 of urbanmap.py) are pretty straightforward. Each row of the database is read in and distributed to one of three arrays: PopuL, city_lat and city_lon. The latter two contain the latitude and longitude of every city. PopuL is a 2D matrix with a row for each city and a column for each population record (i.e., PopuL.shape() returns n_city by n_Times).
I did run into some trouble with blank space characters. A few lines of the .csv files have some non-ascii blank space characters ‘\xa0’ that resulted in errors when I tried to convert the entries into floating point values. So I had to replace those characters with a true blank space before converting:
81 for row in csv_ob:
82 # read in current line
83 line = row
84 line = [item.replace('\xa0',' ') for item in line]
87 # pull out lat/lon
88 city_lat[indx] = float(line[3])
89 city_lon[indx] = float(line[4])
And that’s about it for reading in the data…
(2) Subsampling the data
Now that I’ve got all the data loaded, I want to be able to subsample that data for a specified time range. The main function to do this in urbanmap.py is get_lon_lat_at_t:
Most of the arguments (city_lon,city_lat,Time,PopuL) are returned by the load_cities function, described above. The year_range argument is a string argument that specifies the time span for which I want to select cities with non-zero population records. I chose to make year_range a comma separated string:
year_range='5000,BCE,1900,CE'
This year range starts at 5000 BCE and ends at 1900 CE. The time unit can be BCE,CE,BC or AD. Within get_lon_lat_at_t, I first convert this year_range to a start and end date in ybp:
186 # convert year_range to years before present187 year_range=year_range.replace(" ", "")188 years=year_range.split(',')189 time_range=[0,0]190 time_range[0]=bc_ad_ce_to_ybp(years[0],years[1])191 time_range[1]=bc_ad_ce_to_ybp(years[2],years[3])
Now, I can easily select the cities within the desired time range without knowing beforehand whether or not the chosen endpoints exist exactly in the Time array. First, I loop through each city and select the population records of the current city
193 # find lat and lon of cities with recorded populations in database194 for icit in range(0,ncit):195 pop=PopuL[icit,:] # current population
Next, I find the times in current city that have a nonzero population record
196 pop_t=Time[pop>0] # times with nonzero pops
And now I pull out times that are within the specified time range
197 pop_t=pop_t[pop_t<=time_range[0]] # pops within time range198 pop_t=pop_t[pop_t>=time_range[1]]
The final step is check if there are any cities left. If there are no cities with a nonzero population record in the specified time range, I flag them for removal:
200 if pop_t.size == 0: # flag for removal 201 lons[icit]=999.202 lats[icit]=999.
So at the end of all this, I select the lons and lats that are not equal to 999 and those are the longitudes of the cities with a nonzero population within the specified time range. Neat-o! I can now return these lon/lat values and make some plots!
(3) Plotting the data
Now that we’ve got a function to subsample the data, we can plot that data in a number of ways. The simplest place to start is to put all the lat/lon of cities with a nonzero population record for a specified time range on a map. This is what the __main__ function of urbanmap.py and the script ex_single_frame.py accomplish. In both, I sequentially call load_cities and get_lon_lat_at_t functions then plotting the resulting lat/lon values using the basemap toolkit (mpl_toolkits).
The plotting is accomplished in two functions: urbanmap.base_plot() and urbanmap.city_plots(). The first creates a basemap object with the background image of the globe while the second actually puts the current lat/lon values onto that background image. base_plot() follows this example pretty closely.
The following, from ex_single_frame.py, shows the full sequence to plot cities within a current time span.
33 import urbanmap as um 34 import matplotlib.pyplot as plt 35 36 # select time range 37 time_span='500,BCE,50,CE' 38 # comma separated string noting desired time span. The age descriptor can 39 # be BC, BCE, AD or CE. 40 41 # select data set 42 data_dir='../data_refs/' 43 city_file='modelskiAncientV2.csv' 49 50 # import data set 51 (Time,PopuL,city_lat,city_lon)=um.load_cities(data_dir,city_file) 52 53 # get lon/lat of cities in time span 54 lons,lats,time_range=um.get_lon_lat_at_t(time_span,city_lon,city_lat,Time,PopuL) 55 56 # plot it 57 plt.figure(facecolor=(1,1,1)) 58 m = um.base_plot() # create base plot and map object 59 um.add_annotations() # adds references 60 um.city_plot(lons,lats,m,'singleframe',time_range,Time) # plot points 61 plt.show() # and display the plot
Now that we have functions to pick out cities within a time range and then plot those points, creating an animation is conceptually straightforward. I just needed to repeatedly call get_lon_lat_at_t and city_plot, varying the time range each call. However in practice, sorting through the animation routines in the python animation package was the trickiest part of this whole exercise. I quickly gave up on using the animation routines, and simply looped over a time range, subsampling and plotting at each increment, saving the figure at each step along the way. I was then left with a bunch of image files (the frames of the animation), which I then concatenated into an animation using a bash script and ImageStack.
In the end, I managed to figure out the python animation routines, and that’s I included in the source code.
PyTip: shifting the basemap. I used two functions to rotate the center longitude of the map: mpl_toolkits.basemap.Basemap() and mpl_toolkits.basemap.shiftgrid(). The Basemap function creates the longitude/latitude projection while shiftgrid rotates the topography data to align with the Basemap. BOTH functions take an argument lon0, but in Basemap, lon0 is defined as the center longitude of the projection while in shiftgrid lon0 is the westernmost longitude. I was tripped up by this for a while because I assumed lon0 had the same meaning in each… whoops.
PyTip: accessing the data for animation. Most of the tutorials for the animation function (matplotlib.animation.FuncAnimation) are relatively simple and are set up to re-calculate the data to plot at each frame. The issue I ran into was that FuncAnimation animates a specified function by sequentially feeding it a frame index. I couldn’t figure out how to pass additional arguments (the full dataset) and importing the data set at each frame would be way too slow. I had an existing dataset that I wanted to read in only once at the start of the program. I first got around this by declaring the database variables (PopuL, city_lats, city_lons,…) as global variables so that they’d be accessible within the FuncAnimation call. This was pretty easy but I’m always a little uncomfortable using global variables. My approach in the end relied on simply better understanding how python handles variables. By nesting all of the animation functions under one top level function, any variables set in that top level function are available at the lower levels (in a sense, they’re locally global?). I found this post useful.
References:
Reba, Meredith, Femke Reitsma, and Karen C. Seto. “Spatializing 6,000 years of global urbanization from 3700 BC to AD 2000.” Scientific data 3:160034 doi: 10.1038/sdata.2016.34 (2016).