Jump to content

Data Platform/Systems/Cluster/Geotagging

From Wikitech

Geotagging functions in Hadoop are provided by jars available at hdfs:///wmf/refinery/current/artifacts

Libraries

refinery-core.jar

org.wikimedia.analytics.refinery.core.Geocode exposes two functions

Function Name Data Returned
getCountryCode(String ip) country code
getGeocodedData(String IP) <map> containing geocoding information:
  • continent
  • country_code
  • country
  • subdivision
  • city
  • postal_code
  • latitude
  • longitude
  • timezone

refinery-hive.jar

This library provides wraper functions usable as a hive UDF

Hive UDF Wrapped Function
org.wikimedia.analytics.refinery.hive.GetCountryISOCodeUDF org.wikimedia.analytics.refinery.core.Geocode.getCountryCode
org.wikimedia.analytics.refinery.hive.GetGeoDataUDF org.wikimedia.analytics.refinery.core.Geocode.getGeocodedData

Updates

These functions use a regularly updated version of the MaxMind database that is downloaded on every node of the cluster in the folder /usr/share/GeoIP.

Currently we download MaxMind daily at 0300UTC on two servers: puppetmaster1001 (legacy puppet 5) and puppetserver1001 (new puppet 7). All the rest of the servers pull the updates from their respective puppet server (depending on their puppet version), every 30 minutes.

The pipeline is orchestrated with systemd. We should be alerted on failures. If you have access to puppet servers, you can check systemd logs with:

journalctl -u geoip_update_main.service