Analytics/Systems/Cluster/Geolocation

From Wikitech
Jump to: navigation, search

Fundraising and Analytics both use IP geolocation to better understand how users behave and how best to interact with them.

In Fundraising's case, geolocation (formerly through geoiplookup.wikimedia.org) helps with identifying which users to display banners to, and in what countries, and to identify banner penetration in various locales. In the case of Analytics, geolocation is used to look at things like per-country editor breakdowns through the Geowiki scripts, per-country reader breakdowns (example), and fulfill ad-hoc data requests from various departments.

In all of these situations, we depend on the MaxMind databases. This is a guide to what they are, how they work (and when they don't), and how to access them internally.

Data

MaxMind is an organisation that provides a variety of both free and paid IP geolocation databases, which resolve down to country, region and city level. The Wikimedia Foundation currently has access to:

  • GeoIP Country Edition: resolves down to the country level (e.g., "United States");
  • GeoIP Region Edition: resolves down to the region level (e.g., "California");
  • GeoIP City Edition: resolves down to the city level (e.g., "San Francisco");
  • GeoIP ASNum Edition: resolves down to Autonomous System Numbers;
  • GeoIP Country V6 Edition: resolves down to the country level for IPv6 IPs.

MaxMind updates these databases once a week, on Tuesdays, and the updates filter down to our machines in the form of entire databases (rather than deltas).

There are a few caveats with using MaxMind's data for geolocation.

While it's the most accurate data we have access to, that doesn't mean it's flawless. MaxMind themselves boast 99.8% accuracy on a per-country level, but it drops off at Region level (90% in the US, less elsewhere) and the City level (83% accurate in the US - but only if "accurate" is "within 40km" and less so elsewhere). The MaxMind city accuracy report demonstrates pretty high inaccuracy levels at City resolution for a variety of countries, including many European ones (e.g. Finland). Generally-speaking it's probably not worth relying on for anything below country-level, unless you really really have to.

IPv6 support is currently very patchy for the paid versions, resulting in a generic error message rather than actual data. MaxMind claim that they'll have resolved this by "Q4 of 2013"; given that it's currently early 2014 at time of writing, and still not resolved, we can conclude this was optimistic. Finally, when using it for analysing historical data, bear in mind that IPs do (very occasionally, but still occasionally) switch nations between database updates.

Access and formats

Raw data

The full MaxMind databases are downloaded to the /usr/share/GeoIP directory on each of the stat machines (to see the code for this process, search the operations/puppet repository for "geoip"). If you don't have access to those servers and need it, read the production access section of Analytics/Data Access.

The folders include the data in a legacy format (.dat), but you probably want it in MaxMind database format (.mmdb).

Libraries

MaxMind provides a number of officially supported libraries for accessing the data, including a Python one. Note that these libraries allow you to either access a local copy of the databases or query the MaxMind web service. Use our convenient local copy; sending IP addresses to MaxMind's web service would violate the privacy policy. You don't need to worry about accidentally using the web service, since that would require accidentally figuring out our license key and pasting it into your script, but, still, don't use the web service!

Command-line utility

The geoiplookup databases can also be accessed from any of our analytics machines, including stat1 and stat1002 (stat2), through the path:

/usr/bin/geoiplookup [IP address]

Either way, once queried, the MaxMind databases produce something that looks like...

ironholds@stat1002:~$ /usr/bin/geoiplookup 216.38.130.164 
GeoIP Country Edition: US, United States
GeoIP City Edition, Rev 1: US, CA, San Francisco, N/A, 37.774899, -122.419403, 807, 415
GeoIP Region Edition, Rev 1: US, CA
GeoIP City Edition, Rev 0: US, CA, San Francisco, N/A, 37.774899, -122.419403
GeoIP Region Edition, Rev 0: US, CA
GeoIP ASNum Edition: AS6994 Fastmetrics

(Using the office IP address for obvious privacy reasons. I suspect people know where we work.)

To break this down, we have the country, city, region and ASNum editions, as mentioned above, with two different revisions of city and region. The IPV6 database isn't displayed, because it's not an IPV6 IP.

For our purposes, the most-likely useful datapoints are country, region and city. Country is best retrieved from the GeoIP Country Edition, simply because the outputted data is the most useful; while the region- and city-level databases also generate country IDs, they're two or three letter abbreviations that can be difficult for end users to parse if they're passed into a datasets or visualisations. The Country Edition, on the other hand, produces the full name ("United State" versus "US").

The databases cannot be queried one-by-one, and will always return some variant on the above data format. This is good because it guarantees you always retrieve all the available data, and bad because it demands some data scrubbing (see the example functions for how this can be handled in R).

Example functions

Example functions for interfacing with the geolocation database; if you've got one in your language (Python, say), post it where everyone can use it, darnit.

R

  #Function for retrieving country-level data
  geoip <- function(IP){
    
    #Return data on the submitted IP from the MaxMind GeoIP database, subsetting to get the bit we actually care about
    IPData <- system(command = paste("/usr/bin/geoiplookup",IP),
                     intern = TRUE)[1]
    
    #Use regex to remove, well, junk, retrieve first elements, and concatenate.
    processed_IPData <- strsplit(x = gsub("(GeoIP Country Edition: )|([0-9])|(N/A)|(\\.)|(-)","",IPData),
                                 split = ", ")[[1]][2]
    
    #Return!
    return(processed_IPData)
    
  }