Data Engineering/Systems/Cluster/Geotagging

From Wikitech

Geotagging functions in Hadoop are provided by jars available at hdfs:///wmf/refinery/current/artifacts

Libraries

refinery-core.jar

org.wikimedia.analytics.refinery.core.Geocode exposes two functions

Function Name Data Returned
getCountryCode(String ip) country code
getGeocodedData(String IP) <map> containing geocoding information:
  • continent
  • country_code
  • country
  • subdivision
  • city
  • postal_code
  • latitude
  • longitude
  • timezone

refinery-hive.jar

This library provides wraper functions usable as a hive UDF

Hive UDF Wrapped Function
org.wikimedia.analytics.refinery.hive.GetCountryISOCodeUDF org.wikimedia.analytics.refinery.core.Geocode.getCountryCode
org.wikimedia.analytics.refinery.hive.GetGeoDataUDF org.wikimedia.analytics.refinery.core.Geocode.getGeocodedData

Updates

These functions use a regularly updated (every week) version of the MaxMind database that is downloaded on every node of the cluster in the folder /usr/share/GeoIP.