I recently started a couple of projects that will involve using shapefiles and I got frustrated real fast. Many tutorials that I found assumed some previous knowledge of either shapefiles or the python libraries used to manipulate them. But what I wanted was a tutorial that helped me to plot a simple shapefile while getting to know what a shapefile actually is!
So here’s a SUPER simple example of how to load, inspect and plot a shapefile to make a map of the U.S! There are quite a few Python libraries dealing with shapefiles and it was hard to find the easiest place to start. I found the pyshp Python library the most approachable, so that’s what I use in the following example. There are many ways to visualize shapefiles in a more automated way than I do here, but I think that my approach here gives a clearer picture to a beginner of what a shapefile is and how to use Python with shapefiles.
The shapefile
Go get yourself a shapefile! The one I used (which will definitely work with my code below) is the lowest resolution state-level cartographic boundary shapefile from census.gov (link to census.gov, direct link to lowest resolution 20m .zip file). Once you download the .zip file, unpack it and take a look inside. A shapefile is actually a collection of different files, including a .shp file containing information on shape geometry (state boundaries in this case), a .dbf file containing attributes of each shape (like the name of each state) and others (check out the wiki page on shapefiles for a description of the other file extensions).
The code!
You can download my Python code: https://github.com/chrishavlin/learning_shapefiles
At present, the src folder includes only one python script: basic_read_plot.py. To run this script you will need to:
- install the pyshp Python library (and numpy and matplotlib if you don’t have them already)
- edit the variables in the source code describing the path to the shapefile (dat_dir and shp_file_base in src/basic_read_plot.py)
After those two steps, just open up a terminal and run the script (assuming you’re in the src directory):
$ python basic_read_plot.py
The three plots described below should pop up.
So what does the code do?
After the initial comment block and library import, the code reads in the shapefile using the string variables that give the location of the shapefile directory (data_dir) and the name of the shapefile without extension (shp_file_base):
sf = shapefile.Reader(dat_dir+shp_file_base)
This creates a shapefile object, sf, and the next few lines do some basic inspections of that object. To check how many shapes have been imported:
print 'number of shapes imported:',len(sf.shapes())
For the census.gov state boundary shapefile, this returns 52 for the 50 states, Washington D.C. and Puerto Rico.
For each shape (or state), there are a number of attributes defined: bbox, parts, points and shapeType. The pyshp documentation describes each, and I’ll touch on each one in the following (except for shapeType).
The first thing I wanted to do after importing the shapefile was just plot a single state. So I first pull out the information for a single shape (in this case, the 5th shape):
shape_ex = sf.shape(5)
The points attribute contains a list of latitude-longitude values that define the shape (state) boundary. So I loop over those points to create an array of longitude and latitude values that I can plot. A single point can be accessed with shape_ex.points[0] and will return a lon/lat pair, e.g. (-70.13123,40.6210). So I pull out the first and second index and put them in pre-defined numpy arrays:
x_lon = np.zeros((len(shape_ex.points),1)) y_lat = np.zeros((len(shape_ex.points),1)) for ip in range(len(shape_ex.points)): x_lon[ip] = shape_ex.points[ip][0] y_lat[ip] = shape_ex.points[ip][1]
And then I plot it:
plt.plot(x_lon,y_lat,'k') # use bbox (bounding box) to set plot limits plt.xlim(shape_ex.bbox[0],shape_ex.bbox[2])
This returns the state of Oregon! I also used the bbox attribute to set the x limits of the plot. bbox contains four elements that define a bounding box using the lower left lon/lat and upper right lon/lat. Since I’m setting the axes aspect ratio equal here, I only define the x limit.
Great! So all we need now is to loop over each shape (state) and plot it! Right? Well this code snippet does just that:
plt.figure() ax = plt.axes() ax.set_aspect('equal') for shape in list(sf.iterShapes()): x_lon = np.zeros((len(shape.points),1)) y_lat = np.zeros((len(shape.points),1)) for ip in range(len(shape.points)): x_lon[ip] = shape.points[ip][0] y_lat[ip] = shape.points[ip][1] plt.plot(x_lon,y_lat) plt.xlim(-130,-60) plt.ylim(23,50)
And we can see some problems with the result:
The issue is that in some of the shapes (states), the geometry has multiple closed loops (because of the islands in some states), so simply connecting the lat/lon points creates some weird lines.
But it turns out that the parts attribute of each shape includes information to save us! For a single shape the parts attribute (accessed with shape.parts) contains a list of indeces corresponding to the start of a new closed loop within a shape. So I modified the above code to first check if there are any closed loops (number of parts > 1) and then loop over each part, pulling out the correct index range for each segment of geometry:
plt.figure() ax = plt.axes() # add the axes ax.set_aspect('equal') for shape in list(sf.iterShapes()): npoints=len(shape.points) # total points nparts = len(shape.parts) # total parts if nparts == 1: x_lon = np.zeros((len(shape.points),1)) y_lat = np.zeros((len(shape.points),1)) for ip in range(len(shape.points)): x_lon[ip] = shape.points[ip][0] y_lat[ip] = shape.points[ip][1] plt.plot(x_lon,y_lat) else: # loop over parts of each shape, plot separately for ip in range(nparts): # loop over parts, plot separately i0=shape.parts[ip] if ip < nparts-1: i1 = shape.parts[ip+1]-1 else: i1 = npoints seg=shape.points[i0:i1+1] x_lon = np.zeros((len(seg),1)) y_lat = np.zeros((len(seg),1)) for ip in range(len(seg)): x_lon[ip] = seg[ip][0] y_lat[ip] = seg[ip][1] plt.plot(x_lon,y_lat) plt.xlim(-130,-60) plt.ylim(23,50) plt.show()
And we can see those spurious lines are now gone:
Final Thoughts
Now that I feel pretty good about the information contained in a shapefile and how it’s stored, I’ll be moving on to more exciting visualizations. It’s important to note, that there are many Python libraries that can plot shapefiles without manually pulling out the points as I’ve done here. But I feel much better about using those fancier approaches now that I’ve gone through this exercise.
Also, in this post I’ve only touched on the geometry information in a shapefile. But it’s really the records included in the .dbf files that will make this an interesting visualization. The records contain measurements, observations or descriptions for each shape and that information can be used to color or fill each shape to create visualizations like this one (not my work).
Useful links: pyshp documentation, Plot shapefile with matplotlib (Stack Exchange)