In my recent coding exploits, I’ve downloaded lots of different shapefiles. Most shapefiles were accompanied by nice .xml documenation with information about the data and how its stored or labeled, but a few had hardly any information at all. I knew the general content based on the description from the website were I downloaded the shapefile, but I didn’t know what they had used for the record labels and I didn’t know what the record values were exactly. So the past couple days I sat down and wrote a bit of code to help in unraveling a myserious shapefile…
Check out (and/or download) the full Python source here: shapefile inspection!
The program is fairly straightforward. It traverses the records of a shapefile, recording the record label (or “field names” as I refer to them in the source) and information about each record. One of the program’s methods uses the Python XML API called ElementTree to produce an xml file that you can load in a browser. Here’s a screen shot from using Firefox to view the xml file produced when running the program on the Open Street Map shapefile that I extracted via MapZen for my previous post.
In a browser, you can shrink or expand the xml attributes to get some basic information about each record: the name or label of the records, the data type and some sort of sample of the data. If the record data is an integer or float, then the sample will be the min/max values of the record while if it’s a string, it will either be a list of the unique strings in the records or just a sample of some of the strings. The OpenStreetMap shapefile contained some record values that were keywords, like the “highway” attribute in the screen shot above. While other records were strings with unique values for each shape, like the “name” attribute below:
In addition to generating an xml file, the program allows you to interactively explore a field.
When you run the program from command line (type in python inspect_shapefile.py in the src directory), it’ll ask for your input. It first asks if you want to give it a shapefile, here I said no and used the shapefile hardwired into __main__ of inspect_shapefile.py:
Do you want to enter the path to a shapefile? (Y/N) N Using shapefile specified in __main__ : directory: ../../learning_shapefiles/shapefiles/denver_maps/grouped_by_geometry_type/ filename: ex_QMDJXT8DzmqNh6eFiNkAuESyDNCX_osm_line Loading shapefile ... ... shapefile loaded!
It then pulls out all the fields in the shapefile records, displays them and asks what you want to do. This is what it looks like using the OpenStreetMaps shapefile:
Shapefile has the following field names ['osm_id', 'access', 'aerialway', 'aeroway', 'amenity', 'area', 'barrier', 'bicycle', 'brand', 'bridge', 'boundary', 'building', 'covered', 'culvert', 'cutting', 'disused', 'embankment', 'foot', 'harbour', 'highway', 'historic', 'horse', 'junction', 'landuse', 'layer', 'leisure', 'lock', 'man_made', 'military', 'motorcar', 'name', 'natural', 'oneway', 'operator', 'population', 'power', 'place', 'railway', 'ref', 'religion', 'route', 'service', 'shop', 'sport', 'surface', 'toll', 'tourism', 'tower:type', 'tracktype', 'tunnel', 'water', 'waterway', 'wetland', 'width', 'wood', 'z_order', 'way_area', 'tags'] Do you want to investigate single field (single)? Generate xml file (xml)? Or both (both)? single Enter field name to investigate: landuse
So you can see all these different fields. I chose to look at a single field (“landuse”) and the program will then look at the “landuse” record value for each shape, record its data type and save new record values:
searching for non-empty entry for landuse ... data type found: str Finding unique record values for landuse 1 of 212550 shapes ( 0.0 % ) new record value: 93 of 212550 shapes ( 0.04 % ) new record value: reservoir 6782 of 212550 shapes ( 3.19 % ) new record value: residential 110432 of 212550 shapes ( 51.95 % ) new record value: grass 111094 of 212550 shapes ( 52.26 % ) new record value: construction Completed field name inspection --------------------------------------- Shapefile has the following field names ['osm_id', 'access', 'aerialway', 'aeroway', 'amenity', 'area', 'barrier', 'bicycle', 'brand', 'bridge', 'boundary', 'building', 'covered', 'culvert', 'cutting', 'disused', 'embankment', 'foot', 'harbour', 'highway', 'historic', 'horse', 'junction', 'landuse', 'layer', 'leisure', 'lock', 'man_made', 'military', 'motorcar', 'name', 'natural', 'oneway', 'operator', 'population', 'power', 'place', 'railway', 'ref', 'religion', 'route', 'service', 'shop', 'sport', 'surface', 'toll', 'tourism', 'tower:type', 'tracktype', 'tunnel', 'water', 'waterway', 'wetland', 'width', 'wood', 'z_order', 'way_area', 'tags'] The field name landuse is str and has 5 unique values Display Values? (Y/N) Y possible values: ['', 'reservoir', 'residential', 'grass', 'construction']
As you can see from the output, there were 4 keywords (reservoir, residential, grass and construction) used to describe the ‘landuse’ field. So I could now write some code to go into a shapefile and extract only the shapes that have a ‘residential’ value for ‘landuse.’ But I couldn’t do that until I (1) knew that the landuse field existed and (2) knew the different definitions for landuse type.
So there it is! That’s the program. Hopefully all the shapefiles you ever download will be well-documented. But if you find one that’s not and you really need to figure it out, this little tool might help!
Some code notes and tips
The xml file that I create didn’t follow any particular standard or convention, just what I thought might be useful. Perhaps that could be improved?
REMEMBER THAT IN PYTHON, YOU NEED TO EXPLICITLY COPY LISTS! I stupidly forgot that when you make a list
list_a = list() list_a.append('blah') list_a.append('d')
And then want to make a copy of the list, if you do this:
list_b = list_a
Then any changes to list_b will change list_a. But if you do
list_b = list_a[:]
You’ll get a new copy that won’t reference back to list_a. This is probably one of the things that I forget most frequently with Python lists. Palm-smack-to-forehead.
The XML API ElementTree was pretty great to work with. You can very easily define a hierarchy that will produce a nice xml tree (see this example). I did, however, have some trouble parsing the direct output from the type() function. When you calculate a type,
you get this:
When I gave it directly to ElementTree (imported as ET here), like this:
ET.SubElement(attr, "attrtype",name="data type").text = type(0.01)
I would get some errors because of the quotation marks enclosed. To get around this, I converted the type output to a string, split it up by the quotes and took the index that would just be the type (int, str, or float):
ET.SubElement(attr, "attrtype",name="data type").text = str(type(0.01)).split("'")