Pandas Dataframes to Fortran Formatting

Ahhhh, Fortran.

The past few weeks I’ve had the pleasure (I think that’s the right word?) of dusting off my olde Fortran skills for several unrelated projects, one of which requires exporting of a pandas dataframe from python to a plain text CSV readable by some legacy fortran code. We wanted to speed up and simplify that code, and in the end I figured out a neat way to do it. The solutions out on the web (e.g., this SO post) generally follow the approach of iterating over every row in the dataframe, writing out each row with the appropriate formatting. While the pandas.DataFrame.to_csv() function has an option to format columns, it applies that formatting to every column and we need individual control.

So here’s a notebook (available over at github as well) to help out anyone out using python and fortran in tandem! It makes use of the lovely fortranformat package so that you can use your beloved fortran format strings exactly as you do in fortran!

[iframe src=”https://nbviewer.jupyter.org/github/chrishavlin/miscellaneous_python/blob/master/notebooks/PandasFortranFormat.ipynb” height=”1600″]

3D rendering of velocity anomalies

Lately I’ve been working on using the yt library (https://yt-project.org) for 3D visualization of seismic data sets. Seismologists generally avoid 3D plots in lieu of 2D slices of 3D data as it’s often hard to interpret 3D structure. Much of that difficultly comes from highlighting iso-surfaces, but yt is different in that it uses ray-tracing to project through a data set, integrating information as it goes. The result is a 3D rendering that maintains accumulated structure.

Here’s an example of a 3D rendering of negative velocity anomalies in the Western U.S. from 50-1200 km deep, using the shear wave data from James et al. 2011, one of the 3D datasets available via IRIS (https://ds.iris.edu/ds/products/emc-nwus11-s/):

The highlighted structures represent regions in the Earth’s mantle that exhibit slower seismic shear wave speeds when compared to a reference model and the faint grid in the middle of the model domain is at 410 km, the upper limit of the mantle transition zone. The white dot on the surface is Yellowstone.

This is a preliminary image, so I won’t dive into interpretation of the structure embedded here, so for now, just enjoy this mesmerizing GIF:

A primer on map projections in Python

Ok, so after my post a little while ago about converting contours to shapefiles, I spent a day adapting the approach for use in a in a dash-plotly app… and in general it works, and took a ~800 MB file down to a ~14 MB shapefile. But the rendering of the shapefile was still very slooooooow, and due to limitations of how data callbacks update a figure’s data in Dash, I couldn’t just render the background map once. So, after some thinking, I realized that if I “simply” calculated my own map projections, I could use the standard x-y plotting routines (which includes contours, as well as the faster scattergl method for plotting a large number of points).

But then I realized that I only vaguely know what map projections are: I knew different projections vary in how they portray the 3D earth on a 2D surface, but when you start using all the mapping libraries out there, you run into a lot of jargon really fast. So here’s a super basic primer on calculation map projections in Python that explains some of that jargon!

Also, this is the first post I decided to write as a Python notebook. The notebook is in my github learning_shapefiles repo, but github has been having some issues displaying notebooks lately, so you can view the full notebook here.

PyProj

To start off, pyproj (link), is a python library based on the command-line library https://proj.org/ that provides a large number of algorithms related to map projections. So go install that (simple to do with pip or conda).

In order to use pyproj to project a lat/lon point onto a map projection, it’s helpful to know a few acronyms:

  • CRS: coordinate reference system
  • EPSG: a database of coordinate reference systems and transformations, different projections covering different regional or local (or global) domains are specified by the number following EPSG. For example, EPSG:4326 is the global reference ellipse used by modern day GPS.
  • WGS84 : World Geodetic System, latest revision, a.k.a. WGS1984, same as EPSG:4326
  • reference ellipse: an ellipse describing the surface of the Earth on which positions (like latitude and longitude) are defined. GPS points use a reference ellipse that approximates the Earth’s geoid (i.e., the gravitational equipotential surface that sea-level would follow if due to gravity alone, see wiki).

So now that’s out of the way…. to project a single lat/lon point with Pyproj, first import pyproj then initialize a projection (Proj) object:

from pyproj import Proj
p=Proj(proj='hammer',ellps='WGS84')

When initializing the Proj (projection) object, you give it the reference ellipse that your lat/lon is defined on, ellps=‘WGS84’, and the projection you want to transform to, in this case a hammer projection. Once you’ve initialized Proj, you can use it to move from lat/lon to cartesian x,y:

lon=-120.5 
lat=42.4
x,y=p(lon,lat)

This lon/lat point become (x,y)=(-9894366.0792,5203184.81636). These values in meters by default and are the cartesian x-y distance from the map center for your point.

The Proj object is flexible enough to take lists (and numpy arrays!), so you can project many points at once. It can also take any of the parameters defined for the PROJ projections as keyword arguments. For example, the Hammer projection, has an argument ‘+lon_0’ to set the 0-longitude for the view, and so in pyProj you can use ‘lon_0’ as a keyword argument when using the hammer projection:

p = Proj(proj='hammer',ellps='WGS84',lon_0=-50)

The rest of the notebook goes through how you could construct a grid of lat/lon lines and plot them for different projections. And at the end, I take a digital elevation model of the western US (which gives an elevation for lat/lon points), project the lat/lon points onto a hammer projection  to get points of (elevation,x,y) and contour it using the standard matplotlib.pyplot.contourf to get the following:

In my particular application, I’ll now be able to project my data to x,y and then use the cartesian contour functions of dash-plotly! The main drawback of this approach is having to write all the routines for handling rotations and plotting lines on maps… but ultimately I think it will work well with Dash, and it will be nice to be able use scattergl with my map point data.

 

Shapefiles in Python: converting contours to shapes

I’m in the process of finishing up a web-app that I’ve written using Dash/Plotly and as someone who is not an expert in front end web development, I’ve really enjoyed how easy it is to build an interactive app without getting bogged down by all the javascript/html/css (though knowing some definitely helps!). But recently I’ve run into an issue while figuring how to plot various datasets as a background contour on a number of interactive maps: while Dash/Plotly is great at plotting interactive points on a map, there’s not a simple way to contour a dataset and use it as the base layer (there is the new scattermapbox class, but there is not yet a simple contouring method for maps like, for example, contourf in the basemap library).

But Dash/Plotly is very good at chloropleth (colored shapes) plots, so I figured that if I could convert a matplotlib.pyplot.contourf plot into a shapefile of polygons, I could then load that shapefile in Dash/Plotly as needed and plot it as a chloropleth plot, which seems easier to me than rendering a bitmap image of the contour plot in the background (the only other option I can think of)… and so I spent the morning experimenting with how to do this. There’s a few scattered stackoverflow questions that related to this, but I had to pull together a bunch of pieces to get this working,  so I figured a short writeup here might be useful to others out there.

So with that said, here are they python libraries that you’ll need to run my test script:

  • shapely (for polygons, used previously here, here and here)
  • fiona (for writing a shapefile)
  • descartes (for making some plots)
  • matplotlib (for the initial contouring of the data)
  • numpy (for making up some initial data and some array manipulation)

The full code is available in my learning_shapefiles repo and the new script is contourf_to_shp.py. And some useful stackoverflow answers that helped significantly: Converting Matplotlib contour objects to Shapely objects, matplotlib – extracting values from contour lines.

Creating data to contour 

To start off, I wanted to work with a small test data set that would include multiple domains at the same contour level, so I created a lat/lon grid and then superimposed two 2d gaussian curves with different amplitudes and decay rates and plotted those up:

from shapely import geometry
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
import fiona
import os,json
from descartes.patch import PolygonPatch

# create some test data with multiple peaks
lon = np.linspace(0,45,100)
lat = np.linspace(-20,32,90)
long,latg=np.meshgrid(lon,lat)
C1=np.sqrt((long-5.)**2+(latg-25)**2)/30.
C2=np.sqrt((long-30.)**2+(latg-1)**2)/10.
m = 30*np.exp(-C1**2)+20.*np.exp(-C2**2)

# make the contourf plot, storing the resulting ContourSet in cs
plt.figure(figsize=[10,5])
plt.subplot(1,2,1)
Nlevels=10
cs = plt.contourf(lon,lat,m,Nlevels,cmap='gist_heat')
plt.title('contourf figure with Nlevels='+str(Nlevels))

In my actual use-case I’ll be loading a 2D dataset, so most of this will be replaced. But the important bit is that I store the object that results from the contourf call, that cs object contains all the data on the contours.

Storing the contour results

The first step in saving the contoured data is to create a lookup table for the contour levels. The cs object stores each contour in cs.collections and the levels in cs.levels:

# create lookup table for levels
lvl_lookup = dict(zip(cs.collections, cs.levels))

Next, I loop over each contour of cs.collections, convert the contour’s shape into a shapely polygon and store it in a list of dictionaries that also saves that contour’s level value:

# loop over collections (and polygons in each collection), store in list for fiona
PolyList=[]
for col in cs.collections:
    z=lvl_lookup[col] # the value of this level
    for contour_path in col.get_paths():
        # create the polygon for this level
        for ncp,cp in enumerate(contour_path.to_polygons()):
            lons = cp[:,0]
            lats = cp[:,1]
            new_shape = geometry.Polygon([(i[0], i[1]) for i in zip(lons,lats)])
            if ncp == 0:                
                poly = new_shape # first shape
            else:
                poly = poly.difference(new_shape) # Remove the holes
            
            PolyList.append({'poly':poly,'props':{'z': z}})

The list PolyList now holds a shapely polygon as well as the z-value (or level) of that polygon.

Writing the data

In order to use these polygons elsewhere, the easiest thing to do is simply save them within a shapefile. I used the Fiona library for writing, which entails writing the geometry (the polygons) and the properties for each polygon that I stored earlier:

# define ESRI schema, write each polygon to the file
outfi=os.path.join(outname,'shaped_contour.shp')
schema = {'geometry': 'Polygon','properties': {'z': 'float'}}
with fiona.collection(outfi, "w", "ESRI Shapefile", schema) as output:
    for p in PolyList:
        output.write({'properties': p['props'],
            'geometry': geometry.mapping(p['poly'])})

In order to reconstruct the contour plot, I need to save the information about the different levels (to get the colormap correct). And while I could load the shapefile and loop over the shapes to calculate min/max values, it saves time to simply write that data to a different file. And rather than write it as metadata of some kind for the shapefile, I just drop it into a json file:

# save the levels and global min/max as a separate json for convenience
Lvls={'levels':cs.levels.tolist(),'min':m.min(),'max':m.max()}
with open(os.path.join(outname,'levels.json'), 'w') as fp:
    json.dump(Lvls, fp)

Plotting the polygons

From there, the shapefile and levels data can be re-loaded and plotted to get the following:

polygon_conts_N10On the left is the original contour plot, on the write is the plot of Polygon patches colored using the same colormap. And it’s an almost perfect replica… I’m not sure if it’s my eyes or screen, but the lightest colors seem just a hair different between the two. But that difference is even less noticeable as I use more contours, e.g. 1000: 

polygon_conts_N1000

The final bit of code is mostly self explanatory: looping over the shapes in the shapefile, then plotting colored polygon patches like in previous posts. The only tricky business is making sure the polygon facecolor matches the filled contour color. To do this, I load in the contour level json file that I saved off, and load the matplotlib colormap that I used in the initial call to contourf:

# read in levels, define colormap
with open(os.path.join(outname,'levels.json')) as jfile:
    Lvls=json.load(jfile)
levels=np.array(Lvls['levels'])
cmap=plt.cm.gist_heat
lv_range=[Lvls['min'],Lvls['max']]

And then for each polygon, I pull out the z-value that I saved and find the appropriate RGB value for that level with:

        lv=shp['properties']['z'] # this shape's level
        clr=cmap((lv - lv_range[0])/(lv_range[1]-lv_range[0]))

The bit of math with lv_range is ensure that the value is between 0 and 1. The clr variable is a tuple that can be given to the descartes PolygonPatch:

patch = PolygonPatch(poly, facecolor=clr, edgecolor=clr)

And that’s pretty much it. I still have some experimenting to do with how many contour levels I need for the datasets I’m working with, and I expect to run into some scaling issues (the full dataset fields I’m working with are 100s of MB), but at least I have a feasible approach!

Full code here.

Shapely Polygons: Expansion!

Today I found myself needing to scale a large number of geometric shapes corresponding to volcanic fields in the Western United States. Shapely (used in previous posts) makes this easy with the shapely.affinity.scale() method (an affine transformation preserves lines and points, and the scale() method is a simplified transformation without adding rotation or skew).

So I put together a test script using the scale() method (on my github here) that takes the scaling factor as input. Since I’ll be using this with geometric coordinates and will be testing sensitivity of measurements to different sized shapes, I added the option to scale the shape by a a sequences of distances (angular or linear). The code is fairly straight forward and I don’t have time for a detailed post, so for now, here’s a trippy plot showing a squiggly-circle shape scaled by different amounts:

affine_expand

To generate the above plot (assuming you’ve cloned my learning_shapefiles repo and are in the directory containing affine_expansion.py).:

python affine_expansion.py -f ../data/expansion_test_data.csv -exp 0.25,.5,.75,1
.25,1.5,1.75,2

 

merging shapes and plotting the physiographic boundary of the Colorado Plateau

Today I found myself needing to plot the physiographic boundary of the Colorado Plateau in Python. It’s been a while since I’ve touched on shapefiles (or anything on the blog) so I figured I’d write a quick blurb on reading and plotting this particular shapefile.

Data: shapefile of data from  Fenneman and Johnson 1946 [1] available at https://water.usgs.gov/GIS/dsdl/physio_shp.zip

Code to load & plot & write processed data: colorado_plateau.py

Python requirements: pyshp, shapely, matplotlib

What you’ll learn: reading shapefiles, merging polygon shapes in Python with shapely

The Data

The first challenge was finding the actual lat/lon coordinates defining the edge of the Colorado Plateau… it’s amazing how many papers in geology/geophysics plot the boundary but don’t actually reference where the heck they got their coordinates from. After much digging I FINALLY found a paper that actually cited their source: Hopper and Fischer 2018 [2] reference a 1946 publication by Fenneman and Johnson [1] titled “Physiographic divisions of the conterminous U. S.” and after a quick search I found the digitized data from that publication online at water.usgs.gov.

Here’s the summary page containing metadata: https://water.usgs.gov/GIS/metadata/usgswrd/XML/physio.xml

and a direct link to the zipped shapefile:  https://water.usgs.gov/GIS/dsdl/physio_shp.zip.

The dataset contains a large number of physiographic regions and the Colorado Plateau is subdivided into multiple regions, so the code below pulls out the regions within the Colorado Plateau and joins them into a single shape defining the full boundary. To run the code below, unpack physio_shp.zip wherever you downloaded it to and rename the folder to physio (to match expectations for the pyshp shapefile reader).

The Code

The full code is here.

The XML data for the shapefile defines a province code for different provinces, for which the Colorado Plateau sub-regions have a value of 21. So the code (1) reads the shapefiles, (2) finds the shapes with a province code of 21 and (3) combines them.

Step 1:  imports, reading arguments, reading the shapefile.

shapefile is the library for pyshp, otherwise pretty self explanatory:

import shapefile, os,sys
import matplotlib.pyplot as plt
from shapely.geometry import Polygon
from shapely.ops import cascaded_union

# read the arguments
fname=sys.argv[1] # path to physio.shp
outfolder=None
if len(sys.argv)>2:
    outfolder=sys.argv[2] # folder to store output

# read the shapefile
sf = shapefile.Reader(fname)

Step 2: Find the Colorado Plateau shapes.

The shapes are described in the records list of the shapefile object:

sf.records()

records() is a list of attributes for each shape and a single record looks like

[3.886, 9.904, 220, 15, 212, '21b', 'INTERMONTANE PLATEAUS', 'COLORADO PLATEAUS', 'UINTA BASIN', 21]

The final value is the province code — so we just need to save off the indeces for which that value is 21. It turns out the 3rd value in the record list is actually a cross-reference to a shape ID, but for some reason the indexing is offset by 2 when reading this shapefile with python. So the shape data for this shape would be accessed with:

sf.shapes()[218]

rather than 220. Not sure why it’s off by 2 (would expect it to be off by 1 due to python indexing), but in any case, my code simply records the list index as python sees it:

# find the record indeces for colorado plateau (province ID = 21)
i_rec=0
recs_to_plot=[]
for rec in sf.records():
    if rec[-1]==21:
        print(rec)
        print(i_rec)
        recs_to_plot.append(i_rec)
    i_rec=i_rec+1

# plot the individual records
plt.subplot(1,2,1)
for rec in recs_to_plot:
    pts=sf.shapes()[rec].points
    lons=[pt[0] for pt in pts]
    lats=[pt[1] for pt in pts]
    plt.plot(lons,lats,'.k')

As seen above — the coordinates for the shape boundaries for a record are in

sf.shapes()[rec].points

which is a list of longitude and latitude points (which the code unpacks for plotting). This section of code will generate the following outline of the Colorado Plateau regions:

Step 3: merging shapes

This is the fun bit! What we want is just the outer boundary of the union of all the shapes. The python library shapely lets us do this very easily by creating a list of shapely Polygon objects then combining them with the cascaded_union method:

# create a single shape for Colorado Plateau from union of sub-shapes
polies=[]
for rec in recs_to_plot:
    polies.append(Polygon(sf.shapes()[rec].points))
CP_bound=cascaded_union(polies)

# plot the exterior shape
lon,lat = CP_bound.exterior.xy
plt.subplot(1,2,2)
plt.plot(lon,lat,'.k')

and the resulting plot of just the exterior boundary:

Step 4: output the processed data 

The code also exports the lat/lon points defining that exterior boundary with:

# export the final shape as a CSV of boundary points
if outfolder is not None:
    f=open(os.path.join(outfolder,'ColoradoPlateauBoundary.csv'),'w')
    f.write("lon,lat\n")
    for i,j in zip(lon,lat):
        f.write(str(i)+","+str(j)+"\n")
    f.close()

I could have written some code to save the data in a shapefile format, but for such a small amount of data I find it easier to save a CSV and just create a Polygon from the list of points as I need it. I’m actually planning to create a Polygon that will be combined with GeoPandas to find sets of points falling within the plateau (GeoPandas lets you do database joins on geospatial data, it’s awesome!).

Running the Code

To run the code:

python colorado_plateau.py /path/to/physio/physio.shp /folder/for/output/

where the first argument is the path to the downloaded and unpacked shapefile and the second argument is the location to save the CSV file (this argument is optional — no data will be saved if not included).

References

[1] Fenneman, N. M., & Johnson, D. W. (1946). Physiographic
divisions of the conterminous U.S. Reston, VA: US Geological Survey,
Physiographic Committee Special Map. https://water.usgs.gov/GIS/metadata/usgswrd/XML/physio.xml

[2] Hopper, E., & Fischer, K. M. (2018), The changing face of the lithosphere-asthenosphere boundary: Imaging continental scale patterns in upper mantle structure across the contiguous U.S. with Sp converted waves. Geochemistry, Geophysics, Geosystems, 19 , 2 593 – 2 614 . https://doi.org/10. 1029/2018GC007476

Bikes! Part 0

Some months ago I discovered that there are a number of bike shares out there that make their data publicly available, and I’ve been meaning to download some of it and poke around. Well today I finally had some time. And though I’m not sure what I’ll be doing with this data set yet, I wanted to share a figure I made in my initial exploration.

The following figure shows the number of rides per day and the median ride distance for the Portland OR bike share (data from BikeTown: https://www.biketownpdx.com/system-data). I threw the code in a new github repo here so you can take a look if inerested, but I’m not going to go into detail yet (the code just downloads their system data files, concantenates the monthly files and does some minimal processing and plotting). In any case, the figure:

The neat (and maybe unsurprising) feature is the strong seasonality to bike share usage, both in terms of just the number of rides per day (high in summer, low in winter) and the median distance of each ride (longer rides in summer). There is an interesting spike in total rides around May 2018 — maybe excitement for springtime? or additional bikes added to the program? A plot of bike usage percent (total rides / available bikes) might be more illustrative.

So that’s that for now. Hopefully won’t be too long before I have time for some more in depth analysis.

Bikes!

 

Great Circle Paths

Been quite a while since any updates, but here’s a short one!

As a part of a contract I’m working on, I found myself having to plot the major arc of great circle paths on a map. But if you google “how to plot great circle path in [insert python library here]” all the solutions are for plotting minor arcs. Turns out in the end, there’s a really simple trick to plotting the major arc (and I felt pretty dumb when I realized it after wasting a ton of time), but I figured I’d write it up here in case it saves anyone else a bit of time. The short answer: given two points that you want the major arc for, just add the antipodes for each point to the list of points in your path.

First off, in case you need a review of great circles, here’s a globe:

Given two points on the surface of a sphere, there is a single circle that that contains both points (unless you’re at a pole, in which case there are infinite great circles). The short way round is the minor arc (red curve), the long way round is the major arc (green curve). And I needed to plot both of them.

The reason I got into plotting this in the first place is that in seismology, surface waves are described by major and minor arcs. When an earthquake generates seismic waves and is measured at a seismometer somewhere else, the raypath between the epicenter and seismic station falls on a great circle path. And surface waves are referred to in terms of the minor and major arcs: the R1 wave travels the minor arc and the R2 travels the major arc. These waves will actually keep going around the earth’s surface before dissipating: R3 is the R1 after it goes around again, R4 is the R2 after it goes around again, and on and on. So I needed to be able to plot all these.

Ok, so back to plotting…

I was using the plotly library in Python for this plot, so I’ll stick with that for examples here, but there should be similar functions in whatever mapping library you’re using. The full script is on my github page here.

So the important bit is just defining the list of latitude and longitude points. Here, the minor arc points are put into a dictionary:

paths={}
paths['minor_arc']={'lon':[ start_lon, end_lon ],
                    'lat':[start_lat,end_lat], 'clr':'red','dash':None}

When we give this to plotly, we’ll tell it to connect the two points, which will give us the shortest path between the two, the minor arc.

To plot the major arc, we just need to add some points between the start and end so that it takes the long way around. But how to choose the points? Well, turns out that there are tons of confusing pages out there on the trig used for calculating great circle paths, and I almost started to code up some of it… until I realized only needed a couple points. And the antipodes (the point that is exactly opposite a given point on the surface) are both real easy to calculate and  guaranteed to lie on the great circle path. Just add 180 to the longitude and flip the sign of the latitude:

ant1lon=start_lon+180
if ant1lon>360:
    ant1lon= ant1lon - 360
ant1lat=-start_lat

Same for the antipode of the second, end point. The >360 bit is just to make sure longitude remains between 0 and 360 degrees.

And now, we can put the antipodes in a list for the major arc:

paths['major_arc']={'lon':[ start_lon,ant2lon,ant1lon, end_lon ],
                    'lat':[start_lat,ant2lat,ant1lat,end_lat], 
                    'clr':'green','dash':None}

In the script, these paths are then added as a data dictionary used in creating the plotly figure:

DataDict=list()
for path in ['minor_arc','major_arc']:
    DataDict.append(
        dict(
            type = 'scattergeo',
            lon = paths[path]['lon'],
            lat = paths[path]['lat'],
            name= path,
            mode = 'lines',
            line = dict(
                width = 2.5,
                color = paths[path]['clr'],
                dash=paths[path]['dash'],
            ),
            opacity = 1.0,
        )
    )

figdata={}
figdata['data']=DataDict

The full script has a bit more where it actually plots the data (to give the image above), but plotly has some really nice tutorials for that already so I won’t bother explaining all that.

So that’s that! Hope it saves someone else some time.

 

Quick tutorial on Geocoding with Python

I recently found myself needing to get the latitude/longitude of a list of cities (for this map here) and it turns out, it’s pretty easy now that I know how to do it. Here’s a quick tutorial!

Ok, so the process of taking a city and assigning a latitude/longitude point is called geocoding. There are many services that offer this (e.g., Google or Bing Maps APIs) but most I looked at seemed overkill for a one-time task of assigning lat/lon to about 500 cities. But then I discovered OpenStreetMap’s Nominatim!  You can modify the http address to return results in an xml file. For example,  the following searches for Providence, RI:

https://nominatim.openstreetmap.org/search.php?q=Providence+RI+USA&format=xml

 

And returns:

<searchresults timestamp="Thu, 02 Feb 17 16:17:00 +0000" attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright" querystring="Providence RI USA" polygon="false" exclude_place_ids="158799064,159481664" more_url="https://nominatim.openstreetmap.org/search.php?format=xml&exclude_place_ids=158799064,159481664&accept-language=en-US,en;q=0.8&q=Providence+RI+USA">
<place place_id="158799064" osm_type="relation" osm_id="191210" place_rank="16" boundingbox="41.772414,41.861571,-71.4726669,-71.3736134" lat="41.8239891" lon="-71.4128342" display_name="Providence, Providence County, Rhode Island, United States of America" class="place" type="city" importance="0.80724054252736" icon="https://nominatim.openstreetmap.org/images/mapicons/poi_place_city.p.20.png"/>
<place place_id="159481664" osm_type="relation" osm_id="1840541" place_rank="12" boundingbox="41.7232498,42.0188529,-71.7992521,-71.3177699" lat="41.8677428" lon="-71.5814833" display_name="Providence County, Rhode Island, United States of America" class="boundary" type="administrative" importance="0.58173948152676" icon="https://nominatim.openstreetmap.org/images/mapicons/poi_boundary_administrative.p.20.png"/>
</searchresults>

If you scroll to the right you’ll see:

lat="41.8239891" lon="-71.4128342"

It’s pretty easy to write a python script to request then parse the xml result for lat and lon.  Here’s what that might look like (BUT DON’T DO THIS):

import urllib2

city='Providence, RI'
city_search=city.replace(' ','').split(',') # removes whitespace, splits city/state

# build the http address:
# (results in a string: 'https://nominatim.openstreetmap.org/search.php?q=Providence+RI+USA&format=xml')
osm='https://nominatim.openstreetmap.org/search.php?q=' 
fmt='+USA&polygon=1&format=xml'
srch = osm + city_search[0] + '+' + city_search[1] + fmt

# now use urllib2 to open the url and store the result:
response = urllib2.urlopen(srch)
the_page = response.read().split()

# and now we can parse the resulting string array where the xml info is stored.
# the it only stores the first Lon/Lat that it encounters

Lon = 0.0
Lat = 0.0
for iel in range(0,len(the_page)): # loop over the strings in the_page, look for Lat/Lon
    if 'lon=' in the_page[iel] and Lon == 0.0:
        Lon=float(the_page[iel].split("'")[1])
    if 'lat=' in the_page[iel] and Lat == 0.0:
        Lat=float(the_page[iel].split("'")[1])

So. Why not just loop over your list of cities and repeat this exercise? Well if you check out Nomanatim’s documentation page, and take a look at the usage policy, it requires: “(1) No heavy uses (an absolute maximum of 1 request per second). (2) Provide a valid HTTP Referer or User-Agent identifying the application (stock User-Agents as set by http libraries will not do). (3) Clearly display attribution as suitable for your medium. (4) Data is provided under the ODbL license which requires to share alike (although small extractions are likely to be covered by fair usage / fair dealing).” While I don’t think that my case of simply geocoding 500 or so cities falls under heavy usage and I could just delay my successive calls, I decided to look into their suggestions for other options.

In the end I settled on MapQuest’s implementation of Nominatim. It provides access to all the OpenStreetMaps data (still open source and subject to the OSM license agreements) and a MapQuest free developer account gets you 15,000 request/month for free. Waaay more than I’d need for this project.

So to geocode a list of cities, first sign up for a MapQuest Developer Account. You’ll get an API key assigned to you. Unlike some other API’s, MapQuest doesn’t use any fancy authentication. You basically just make a request for the URL with the API in the http address. Reaaaaally easy (but not exactly secure).

Then you can run a code very similar to that above. My implementation is here: look_up_latlons.py. But it’s kind of tied to the data that I was mapping.

Some notes on the code.

(1)  the API key is passed in through a command line argument, so when you run this code you have to type

$ python look_up_latlons.py AL1243KSFD242332552134KLJ

where that long string of letters/numbers is whatever your API key is.

(2) And then the formatting of the http address is slightly different from the standard Nominatim api. The same search for Providence RI  looks like:

 http://open.mapquestapi.com/nominatim/v1/search.php?key=API_KEY&format=xml&q=Providence+RI

where API_KEY is, again, your API key.

(3) In my implementation, I have imported a CSV file as a pandas dataframe (called Counts). Each row contains a city name along with the number of people who marched in the Women’s Marches on Jan. 21. The meat of the code is copied below, in which I iterate over the rows in the dataframe (named Counts here), find the lat/lon for each row (i.e., each city) and then store that lat/lon in a new dataframe (NewCounts) because it’s bad to modify an existing dataframe while iterating over it. Here’s what that looks like:

 osm='http://open.mapquestapi.com/nominatim/v1/search.php?key='+API_KEY+'&format=xml&q='

 # loop over cities in crowd counts, find Lat/Lon
 NewCounts=Counts.copy()
 NewCounts['lon']=np.zeros(len(Counts)) # add new column for lon
 NewCounts['lat']=np.zeros(len(Counts)) # add new column for lat
 for index, row in Counts.iterrows():

     srch=osm+str(row['City']).replace(' ','+')

     print '\n\nLooking up lat/lon for',row['City'],index
     time.sleep(dt) 
     response = urllib2.urlopen(srch)
     the_page = response.read().split()
 
     for iel in range(0,len(the_page)):
         if 'lon=' in the_page[iel] and NewCounts['lon'][index]==0.0:
            NewCounts['lon'][index]=float(the_page[iel].split("'")[1])
         if 'lat=' in the_page[iel] and NewCounts['lat'][index]==0.0:
            NewCounts['lat'][index]=float(the_page[iel].split("'")[1])

      print row['City'],NewCounts['lon'][index],NewCounts['lat'][index]

The MapQuest API didn’t have any specific usage constraints for how frequently you make a request, just overall number in a month, but I added a small delay between calls using the time.sleep() function anyway.

That’s all for now, hopefully some more posts with colorful plots coming soon!