Geolocation

From Wikitech
(Redirected from GeoIP)

Geolocation is based on the MaxMind GeoIP2 database paid for by the WMF, and is used in two ways:

  • Varnish adds a cookie called GeoIP (only if the request does not already have one), with lifetime set to the current session, in the format <ISO 3166-1 country code>:<ISO 3166-2 region code>:<city name>:<lat>:<long>:<???>
  • The analytics pipeline adds geolocation data to the geocoded_data field of the webrequest table, based on the IP address.

To look up data by hand, log in to mwlog1001 or mwmaint1002 and run mmdblookup --file /usr/share/GeoIP/GeoIP2-City.mmdb --ip <IP> (see maxmind's site for documentation of the returned data structure) or, if you just want a single field, something like mmdblookup --file /usr/share/GeoIP/GeoIP2-City.mmdb --ip <IP> country names en.

History

Geolocation started as a Fundraising-Tech initiative introduced in 2009. Some links around how its various incarnations are/were used:

Unknown country

You may encounter geo data where country is "Unknown" and country code is "--"

The primary (and perhaps only?) source of this is requests and edits made internally, such as bots running on Toolforge and other Wikimedia Cloud Services infrastructure.

They will have IP addresses starting with "10." – for example in the cu_changes table:

SELECT
  cuc_ip, cuc_agent,
  COUNT(1) as n_changes
FROM cu_changes
WHERE cuc_ip RLIKE '^10\\.'
GROUP BY cuc_ip, cuc_agent
ORDER BY n_changes DESC

One example is IP address "10.192.32.203" (with User-Agent ChangePropagation-JobQueue/WMF) and indeed, it is one of our servers (cf. codfw.wmnet). If we geolocate that:

ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive-shaded.jar;

CREATE TEMPORARY FUNCTION get_geo_data as 'org.wikimedia.analytics.refinery.hive.GetGeoDataUDF';

SELECT get_geo_data('10.192.48.103') AS geo_data;

we get:

{
	"city" : "Unknown",
	"subdivision" : "Unknown",
	"timezone" : "Unknown",
	"country_code" : "--",
	"country" : "Unknown",
	"continent" : "Unknown"
}