Data Platform/Systems/Cluster/Geotagging
Geotagging functions in Hadoop are provided by jars available at hdfs:///wmf/refinery/current/artifacts
Libraries
refinery-core.jar
org.wikimedia.analytics.refinery.core.Geocode
exposes two functions
Function Name | Data Returned |
---|---|
getCountryCode(String ip)
|
country code |
getGeocodedData(String IP)
|
<map> containing geocoding information:
|
refinery-hive.jar
This library provides wraper functions usable as a hive UDF
Hive UDF | Wrapped Function |
---|---|
org.wikimedia.analytics.refinery.hive.GetCountryISOCodeUDF
|
org.wikimedia.analytics.refinery.core.Geocode.getCountryCode
|
org.wikimedia.analytics.refinery.hive.GetGeoDataUDF
|
org.wikimedia.analytics.refinery.core.Geocode.getGeocodedData
|
Updates
These functions use a regularly updated version of the MaxMind database that is downloaded on every node of the cluster in the folder /usr/share/GeoIP
.
Currently we download MaxMind daily at 0300UTC on two servers: puppetmaster1001
(legacy puppet 5) and puppetserver1001
(new puppet 7). All the rest of the servers pull the updates from their respective puppet server (depending on their puppet version), every 30 minutes.
The pipeline is orchestrated with systemd. We should be alerted on failures. If you have access to puppet servers, you can check systemd logs with:
journalctl -u geoip_update_main.service