This describes the Geowiki system as it runs today (April 2018). For reference on the old Python geowiki system, please see: Analytics/Archive/Geowiki.
Geoeditors aggregates location information from edits and enables reporting editing activity levels by country. The resulting data is private for now, but there are efforts to make it public while maintaining privacy. This document describes the system itself, while Analytics/Data Lake/Edits/Geoeditors decribes the dataset.
- 1 TL:DR
- 2 Comparing Data with Previous System
- 3 Migration From Previous Geowiki System
- 4 Accessing the old Geowiki data
We have migrated the old geowiki datasets (editors per country) to the analytics stack from the old jobs that used to run in python. Data differs significantly, some per country stats as much as 50%. We think much of this difference is due to the old system dealing with grossly outdated geo-IP info and thus not being able to localize a bunch of IPs.
Comparing Data with Previous System
We performed a series of checks to make sure new data was sensible and in line with old data. We can't share very specific details due to the private nature of this data, but we will show information about the percent differences and top 20 rankings of countries by number of editors and projects by number of editors, to show how similar the data is. I (Milimetric) am not an analyst, but my conclusion is that the systems are tracking the same signal and the new data is an improvement in quality and completeness.
Here are the results of joining the new data with the old by country and computing statistics on the set of percent differences. We can't establish trends over time with only the few months that we have both sets, so we threw out the outliers by the inter-quartile method and found:
Statistical analysis of (old - new) / old percentages of data by country Quartile 1 0.3762332224 Quartile 3 0.8296221323 Interquartile Range 0.4533889098 Upper Bound 1.509705497 Lower Bound -0.3038501423 Average 0.5671689955 Median 0.4869141814
The percent difference was computed as (New - Old) / Old, so this tells us new numbers are about 50% bigger than old numbers, and usually within 37% to 83% bigger. I believe this large difference is due mostly to the outdated geolocation database that was used with the old data. There were over 50 thousand unrecognized IPs in the old dataset and only around 9 thousand in the new. And those 40 thousand or so IPs were probably distributed evenly among the countries with lower editing activity, causing a wide range with distant outliers and higher average difference.
If we do the same thing by project, here are the results:
Statistical analysis of (old - new) / old percentages of data by project Quartile 1 0.0465909091 Quartile 3 0.2307692308 Interquartile Range 0.1841783217 Upper Bound 0.5070367133 Lower Bound -0.2296765734 Average 0.1346626531 Median 0.125
Here we see a smaller difference. We checked the raw data that both of these datasets are derived from, and it makes sense. There are about 12% more non-administrative actions than there are namespace 0 actions, and that's how the new and old filters work, respectively. So these findings are in line with what we expect, with other differences being due to different ways of filtering bots and the rest of what is mentioned above.
NOTE: All of the new data we used here excludes anonymous edits, which are substantial, so keep that in mind if you include those. We excluded them because the old data excludes them as well.
Editors in Country
The following two queries were run on just english wikipedia and overall, and the top 20 countries in both cases were fairly similar, but the number of invalid IPs in the Python system are apparent. It seems reasonable to assume that the bigger difference in countries like India can be attributed to this invalid IP problem.
select country, select country_code, sum(count) as editors_in_country sum(distinct_editors) as editors_in_country from erosen_geocode_active_editors_country from geowiki_monthly where start = '2018-03-01' where month='2018-03' and cohort = 'all' and users_are_anonymous = 0 group by country group by country_code order by country order by country_code ; limit 10000000 ; Overall PYTHON HADOOP country editors country editors Invalid IP 50k United States 72k United States 41k Germany 27k Germany 15k France 21k France 13k United Kingdom 19k United Kingdom 13k India 19k Japan 10k Japan 16k Russia 9k Russia 13k Italy 9k Italy 13k Spain 9k Spain 12k India 8k Canada 11k Canada 7k Invalid IP 10k Australia 5k Brazil 8k Brazil 5k Netherlands 7k Poland 5k Poland 7k Ukraine 4k Australia 7k Netherlands 4k Ukraine 6k Iran 4k Sweden 5k Sweden 4k Iran 5k Taiwan 3k Israel 5k Israel 3k Just English Wikipedia PYTHON HADOOP country editors country editors United States 38k United States 60k Invalid IP 27k United Kingdom 15k United Kingdom 11k India 14k India 7k Canada 8k Canada 6k Australia 6k Australia 5k Germany 4k Germany 2k France 2k France 1k Netherlands 2k Netherlands 1k Pakistan 2k
Editors on Project
The following two queries compare overall per-project editor rankings as computed through the old and new systems. One of the main differences is that the new system collects data from all types of wikis, so we excluded those as they don't have a direct comparison in the old system:
select project, select wiki_db, sum(count) as editors sum(distinct_editors) as editors from erosen_geocode_active_editors_country from geowiki_monthly where start = '2018-03-01' where month='2018-03' and cohort = 'all' and users_are_anonymous = 0 group by project group by wiki_db order by project order by wiki_db ; limit 10000000 ; PYTHON HADOOP project editors project editors enwiki 126k enwiki 149k dewiki 20k frwiki 21k frwiki 19k dewiki 21k eswiki 16k eswiki 19k jawiki 14k jawiki 14k ruwiki 11k ruwiki 13k itwiki 8k itwiki 9k zhwiki 8k zhwiki 9k ptwiki 6k ptwiki 7k plwiki 4k arwiki 5k nlwiki 4k fawiki 5k fawiki 4k plwiki 5k arwiki 4k nlwiki 4k svwiki 3k svwiki 3k ukwiki 3k ukwiki 3k hewiki 3k hewiki 3k kowiki 2k idwiki 3k cswiki 2k kowiki 2k idwiki 2k cswiki 2k fiwiki 2k
Full Notes with Scripts
/* NOTE(s): Data problem 1: Bots in geowiki are identified by a list of IDs unioned with the users in the user_groups table under the 'bot' group. The problem is, the list of IDs is static and not separated per project, so all sorts of cross-project ID colissions can occur and not being mainted for so many years basically makes this list more damaging than good: https://github.com/wikimedia/analytics-geowiki/blob/17c1360e4c34bdf2242c6e75c46506bc4c2d0917/geowiki/process_data.py#L68 Data difference 1: Old geowiki restricts actions on namespace 0, but as long as we exclude actions to non-administrative work, we are counting them in the new system. Example checking query: -- mysql use enwiki; select count(*), sum(if(namespace_zero_edits > 0, 1, 0)) as namespace_zero, sum(if(namespace_nonzero_edits > 0, 1, 0)) as not_namespace_zero, sum(if(types_zero_and_one > 0, 1, 0)) as non_administrative, sum(if(types_not_zero_and_one > 0, 1, 0)) as administrative from (select cuc_user, sum(if(cuc_namespace = 0, 1, 0)) as namespace_zero_edits, sum(if(cuc_namespace <> 0, 1, 0)) as namespace_nonzero_edits, sum(if(cuc_type in (0,1), 1, 0)) as types_zero_and_one, sum(if(cuc_type not in (0,1), 1, 0)) as types_not_zero_and_one from wmf_raw.mediawiki_private_cu_changes where cuc_user <> 0 and month = '2018-03' group by cuc_user ) logged_in_editors ; -- results --+----------+----------------+--------------------+--------------------+----------------+ --| count(*) | namespace_zero | not_namespace_zero | non_administrative | administrative | --+----------+----------------+--------------------+--------------------+----------------+ --| 315231 | 118508 | 256905 | 144139 | 239094 | --+----------+----------------+--------------------+--------------------+----------------+ -- hive ADD JAR hdfs://analytics-hadoop/wmf/refinery/current/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.0.58.jar; CREATE TEMPORARY FUNCTION geocode as 'org.wikimedia.analytics.refinery.hive.GeocodedDataUDF'; with logged_in_editors as (select cuc_user, sum(if(cuc_namespace = 0, 1, 0)) as namespace_zero_edits, sum(if(cuc_namespace <> 0, 1, 0)) as namespace_nonzero_edits, sum(if(cuc_type in (0,1), 1, 0)) as types_zero_and_one, sum(if(cuc_type not in (0,1), 1, 0)) as types_not_zero_and_one, sum(if(geocode(cuc_ip)['country_code']='US' and cuc_namespace = 0, 1, 0)) as namespace_zero_edits_in_US, sum(if(geocode(cuc_ip)['country_code']='US' and cuc_namespace <> 0, 1, 0)) as namespace_nonzero_edits_in_US, sum(if(geocode(cuc_ip)['country_code']='US' and cuc_type in (0,1), 1, 0)) as types_zero_and_one_in_US, sum(if(geocode(cuc_ip)['country_code']='US' and cuc_type not in (0,1), 1, 0)) as types_not_zero_and_one_in_US from wmf_raw.mediawiki_private_cu_changes where cuc_user <> 0 and month = '2018-03' and wiki_db = 'enwiki' group by cuc_user ) select count(*), sum(if(namespace_zero_edits > 0, 1, 0)) as namespace_zero, sum(if(namespace_nonzero_edits > 0, 1, 0)) as not_namespace_zero, sum(if(types_zero_and_one > 0, 1, 0)) as non_administrative, sum(if(types_not_zero_and_one > 0, 1, 0)) as administrative, sum(if(namespace_zero_edits_in_US > 0, 1, 0)) as namespace_zero_in_US, sum(if(namespace_nonzero_edits_in_US > 0, 1, 0)) as not_namespace_zero_in_US, sum(if(types_zero_and_one_in_US > 0, 1, 0)) as non_administrative_in_US, sum(if(types_not_zero_and_one_in_US > 0, 1, 0)) as administrative_in_US from logged_in_editors ; -- results: -- count namespace_zero not_namespace_zero non_administrative administrative us_editors_with_bots_in_hadoop_geowiki us_editors_with_bots_in_old_geowiki -- 322309 120819 263059 146910 245074 261926 271881 */ hive -e "use wmf; select wiki_db, sum(distinct_editors) as editors_on_project from geowiki_monthly where month='2018-03' and users_are_anonymous = 0 group by wiki_db order by wiki_db limit 10000000 ; " > editors-on-project-2018-03-hadoop.tsv hive -e "use wmf; select country_code, sum(distinct_editors) as editors_in_country from geowiki_monthly where month='2018-03' and users_are_anonymous = 0 group by country_code order by country_code limit 10000000 ; " > editors-in-country-2018-03-hadoop.tsv hive -e "use wmf; select country_code, sum(distinct_editors) as editors_in_country from geowiki_monthly where month='2018-03' and users_are_anonymous = 0 and wiki_db = 'enwiki' group by country_code order by country_code limit 10000000 ; " > editors-in-country-enwiki-2018-03-hadoop.tsv
Migration From Previous Geowiki System
In summary, the old system ran a query on the cu_changes table, geolocated and aggregated the results, and inserted them into analytics-slave.eqiad.wmnet, staging database, in tables called erosen_... It then output files for easier use. The relevant query is here: https://github.com/wikimedia/analytics-geowiki/blob/17c1360e4c34bdf2242c6e75c46506bc4c2d0917/geowiki/mysql_config.py#L51 and the most important details are:
- only includes edits on namespace 0
- only includes edits from logged-in users
- includes any type of action, including administrative ones
- uses an outdated geolocation method that misses 40,000 IPs more than our current system
- Filters bots in a slightly broken way
- Uses a static list of user ids which is used to filter out any editors from any project, but the list is not keyed by project. This means it will occasionally filter out valid users on user id collisions.
- Also uses the user_groups table, which is good
The new system, for which the relevant query is here: https://github.com/wikimedia/analytics-refinery/blob/ead38243d1bc31e8003c735a16fa320cdda5f016/oozie/mediawiki/geowiki/monthly/insert_geowiki_daily_data.hql#L54, tries to correct some of these problems and expand the available data:
- includes all edits, regardless of namespace
- includes edits from anonymous users but keeps them separate
- includes only non-administrative actions (edits, but not page moves)
- uses the same geolocation and data pipeline as all the rest of our datasets, so it's consistent
- uses user_groups historical data and user name regular expressions to filter out bots, does not use a static list. This means it may miss some known bots that don't register, but data quality will improve as bot tracking is improved in more structured ways.
Migration consisted of pulling data into HDFS and aggregating it with automatic jobs, then vetting the data manually against the old system. The results of vetting are below, and the new Oozie jobs that coordinate everything can be found here: https://github.com/wikimedia/analytics-refinery/blob/ead38243d1bc31e8003c735a16fa320cdda5f016/oozie/mediawiki/geowiki/
Accessing the old Geowiki data
Geowiki data used to be stored in the staging database of analytics-slave. As of June 2018, legacy Geowiki data from 2012 to April 2018 can be accessed through superset via the
geowiki_archive_monthly datasource, a modified version of the
erosen_geocode_active_editors_country table in MySQL.
Differences with the original data
erosen_geocode_active_editors_country table had daily rows for 30 day data. This means there was a lot of data overlapping across rows, but not daily data:
project country cohort start end count ts en Spain all 2016-02-15 2016-03-16 1084 2016-03-18 12:01:14.0 en Spain all 2016-02-16 2016-03-17 1093 2016-03-19 12:01:14.0 en Spain all 2016-02-17 2016-03-18 1094 2016-03-20 12:01:22.0 en Spain all 2016-02-18 2016-03-19 1094 2016-03-21 12:01:24.0 en Spain all 2016-02-19 2016-03-20 1083 2016-03-22 12:01:34.0 en Spain all 2016-02-20 2016-03-21 1075 2016-03-23 12:01:48.0 en Spain all 2016-02-21 2016-03-22 1067 2016-03-24 12:01:23.0 en Spain all 2016-02-22 2016-03-23 1067 2016-03-25 12:01:05.0 en Spain all 2016-02-23 2016-03-24 1057 2016-03-26 12:01:31.0 en Spain all 2016-02-24 2016-03-25 1062 2016-03-27 12:01:31.0
geowiki_archive_monthly dataset in druid is monthly, so it doesn't specify start/end, just the month for the row. The method of obtaining the monthly values was querying all rows in the original dataset that had started on the first of the month. This way we have data for this selection between the first day of every month, to the next 30 days. The resulting refined dataset has therefore 1/30 of the rows of the original data. We then adjusted every month to its actual number of days (28, 29, 30 or 31) by multiplying the average daily value of that period by the number of days on that month. This normalization may generate non-integer counts, that are rounded to the closest integer value.
Additionally, all country names have been transformed into the ISO_3166-2 standard (
United Kingdom => UK).
Querying the original data
The original data is available in Hive, at the following tables of the
Missing archive data
Due to outages, the following months have no data in
- October 2017
- June 2017
- April 2017
- March 2017
- December 2015
You can find out more about those months in the original dataset in hive, since some days do have 30-day computations.