Data Platform/Geoeditors

FIXME: phab:T356538

This describes the Geowiki system as it runs today (April 2018). For reference on the old Python geowiki system, please see: Analytics/Archive/Geowiki.

Geoeditors aggregates location information from edits and enables reporting editing activity levels by country. The resulting data is private for now, but there are efforts to make it public while maintaining privacy. This document describes the system itself, while Analytics/Data Lake/Edits/Geoeditors decribes the dataset.

TL:DR

We have migrated the old geowiki datasets (editors per country) to the analytics stack from the old jobs that used to run in python. Data differs significantly, some per country stats as much as 50%. We think much of this difference is due to the old system dealing with grossly outdated geo-IP info and thus not being able to localize a bunch of IPs.

Comparing Data with Previous System

We performed a series of checks to make sure new data was sensible and in line with old data. We can't share very specific details due to the private nature of this data, but we will show information about the percent differences and top 20 rankings of countries by number of editors and projects by number of editors, to show how similar the data is. I (Milimetric) am not an analyst, but my conclusion is that the systems are tracking the same signal and the new data is an improvement in quality and completeness.

Percent Difference

Here are the results of joining the new data with the old by country and computing statistics on the set of percent differences. We can't establish trends over time with only the few months that we have both sets, so we threw out the outliers by the inter-quartile method and found:

Statistical analysis of (old - new) / old percentages of data by country

Quartile 1           0.3762332224
Quartile 3           0.8296221323
Interquartile Range  0.4533889098
Upper Bound          1.509705497
Lower Bound         -0.3038501423
Average              0.5671689955
Median               0.4869141814

The percent difference was computed as (New - Old) / Old, so this tells us new numbers are about 50% bigger than old numbers, and usually within 37% to 83% bigger. I believe this large difference is due mostly to the outdated geolocation database that was used with the old data. There were over 50 thousand unrecognized IPs in the old dataset and only around 9 thousand in the new. And those 40 thousand or so IPs were probably distributed evenly among the countries with lower editing activity, causing a wide range with distant outliers and higher average difference.

If we do the same thing by project, here are the results:

Statistical analysis of (old - new) / old percentages of data by project

Quartile 1           0.0465909091
Quartile 3           0.2307692308
Interquartile Range  0.1841783217
Upper Bound          0.5070367133
Lower Bound         -0.2296765734
Average              0.1346626531
Median               0.125

Here we see a smaller difference. We checked the raw data that both of these datasets are derived from, and it makes sense. There are about 12% more non-administrative actions than there are namespace 0 actions, and that's how the new and old filters work, respectively. So these findings are in line with what we expect, with other differences being due to different ways of filtering bots and the rest of what is mentioned above.

NOTE: All of the new data we used here excludes anonymous edits, which are substantial, so keep that in mind if you include those. We excluded them because the old data excludes them as well.

Editors in Country

The following two queries were run on just english wikipedia and overall, and the top 20 countries in both cases were fairly similar, but the number of invalid IPs in the Python system are apparent. It seems reasonable to assume that the bigger difference in countries like India can be attributed to this invalid IP problem.

 select country,                                 select country_code,
        sum(count) as editors_in_country                sum(distinct_editors) as editors_in_country
   from erosen_geocode_active_editors_country      from geowiki_monthly
  where start = '2018-03-01'                      where month='2018-03'
    and cohort = 'all'                              and users_are_anonymous = 0
  group by country                                group by country_code
  order by country                                order by country_code
;                                                 limit 10000000
                                                ;

Overall
PYTHON                                  HADOOP
country          editors                country         editors
Invalid IP       50k                    United States   72k
United States    41k                    Germany         27k
Germany          15k                    France          21k
France           13k                    United Kingdom  19k
United Kingdom   13k                    India           19k
Japan            10k                    Japan           16k
Russia           9k                     Russia          13k
Italy            9k                     Italy           13k
Spain            9k                     Spain           12k
India            8k                     Canada          11k
Canada           7k                     Invalid IP      10k
Australia        5k                     Brazil          8k
Brazil           5k                     Netherlands     7k
Poland           5k                     Poland          7k
Ukraine          4k                     Australia       7k
Netherlands      4k                     Ukraine         6k
Iran             4k                     Sweden          5k
Sweden           4k                     Iran            5k
Taiwan           3k                     Israel          5k
Israel           3k

Just English Wikipedia
PYTHON                      HADOOP
country         editors     country         editors
United States   38k         United States   60k
Invalid IP      27k         United Kingdom  15k
United Kingdom  11k         India           14k
India           7k          Canada          8k
Canada          6k          Australia       6k
Australia       5k          Germany         4k
Germany         2k          France          2k
France          1k          Netherlands     2k
Netherlands     1k          Pakistan        2k

Editors on Project

The following two queries compare overall per-project editor rankings as computed through the old and new systems. One of the main differences is that the new system collects data from all types of wikis, so we excluded those as they don't have a direct comparison in the old system:

 select project,                                  select wiki_db,
        sum(count) as editors                            sum(distinct_editors) as editors
   from erosen_geocode_active_editors_country       from geowiki_monthly
  where start = '2018-03-01'                       where month='2018-03'
    and cohort = 'all'                               and users_are_anonymous = 0
  group by project                                 group by wiki_db
  order by project                                 order by wiki_db
;                                                  limit 10000000
                                                 ;


PYTHON              HADOOP
project editors     project editors
enwiki  126k        enwiki  149k
dewiki  20k         frwiki  21k
frwiki  19k         dewiki  21k
eswiki  16k         eswiki  19k
jawiki  14k         jawiki  14k
ruwiki  11k         ruwiki  13k
itwiki  8k          itwiki  9k
zhwiki  8k          zhwiki  9k
ptwiki  6k          ptwiki  7k
plwiki  4k          arwiki  5k
nlwiki  4k          fawiki  5k
fawiki  4k          plwiki  5k
arwiki  4k          nlwiki  4k
svwiki  3k          svwiki  3k
ukwiki  3k          ukwiki  3k
hewiki  3k          hewiki  3k
kowiki  2k          idwiki  3k
cswiki  2k          kowiki  2k
idwiki  2k          cswiki  2k
                    fiwiki  2k

Full Notes with Scripts

/* NOTE(s):

Data problem 1:
    Bots in geowiki are identified by a list of IDs unioned with the users in the user_groups table under the 'bot' group.  The problem is, the list of IDs is static and not separated per project, so all sorts of cross-project ID colissions can occur and not being mainted for so many years basically makes this list more damaging than good: https://github.com/wikimedia/analytics-geowiki/blob/17c1360e4c34bdf2242c6e75c46506bc4c2d0917/geowiki/process_data.py#L68

Data difference 1:
    Old geowiki restricts actions on namespace 0, but as long as we exclude actions to non-administrative work, we are counting them in the new system.  Example checking query:

-- mysql
 use enwiki;
 select count(*),
        sum(if(namespace_zero_edits > 0, 1, 0)) as namespace_zero,
        sum(if(namespace_nonzero_edits > 0, 1, 0)) as not_namespace_zero,
        sum(if(types_zero_and_one > 0, 1, 0)) as non_administrative,
        sum(if(types_not_zero_and_one > 0, 1, 0)) as administrative

   from (select cuc_user,
                sum(if(cuc_namespace = 0, 1, 0)) as namespace_zero_edits,
                sum(if(cuc_namespace <> 0, 1, 0)) as namespace_nonzero_edits,
                sum(if(cuc_type in (0,1), 1, 0)) as types_zero_and_one,
                sum(if(cuc_type not in (0,1), 1, 0)) as types_not_zero_and_one

           from wmf_raw.mediawiki_private_cu_changes
          where cuc_user <> 0
            and month = '2018-03'
          group by cuc_user
        ) logged_in_editors
;
-- results
--+----------+----------------+--------------------+--------------------+----------------+
--| count(*) | namespace_zero | not_namespace_zero | non_administrative | administrative |
--+----------+----------------+--------------------+--------------------+----------------+
--|   315231 |         118508 |             256905 |             144139 |         239094 |
--+----------+----------------+--------------------+--------------------+----------------+


-- hive
ADD JAR hdfs://analytics-hadoop/wmf/refinery/current/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.0.58.jar;
CREATE TEMPORARY FUNCTION geocode as 'org.wikimedia.analytics.refinery.hive.GeocodedDataUDF';

   with logged_in_editors as (select cuc_user,
                sum(if(cuc_namespace = 0, 1, 0)) as namespace_zero_edits,
                sum(if(cuc_namespace <> 0, 1, 0)) as namespace_nonzero_edits,
                sum(if(cuc_type in (0,1), 1, 0)) as types_zero_and_one,
                sum(if(cuc_type not in (0,1), 1, 0)) as types_not_zero_and_one,

                sum(if(geocode(cuc_ip)['country_code']='US' and cuc_namespace = 0, 1, 0)) as namespace_zero_edits_in_US,
                sum(if(geocode(cuc_ip)['country_code']='US' and cuc_namespace <> 0, 1, 0)) as namespace_nonzero_edits_in_US,
                sum(if(geocode(cuc_ip)['country_code']='US' and cuc_type in (0,1), 1, 0)) as types_zero_and_one_in_US,
                sum(if(geocode(cuc_ip)['country_code']='US' and cuc_type not in (0,1), 1, 0)) as types_not_zero_and_one_in_US

           from wmf_raw.mediawiki_private_cu_changes
          where cuc_user <> 0
            and month = '2018-03'
            and wiki_db = 'enwiki'
          group by cuc_user
        )

 select count(*),
        sum(if(namespace_zero_edits > 0, 1, 0)) as namespace_zero,
        sum(if(namespace_nonzero_edits > 0, 1, 0)) as not_namespace_zero,
        sum(if(types_zero_and_one > 0, 1, 0)) as non_administrative,
        sum(if(types_not_zero_and_one > 0, 1, 0)) as administrative,

        sum(if(namespace_zero_edits_in_US > 0, 1, 0)) as namespace_zero_in_US,
        sum(if(namespace_nonzero_edits_in_US > 0, 1, 0)) as not_namespace_zero_in_US,
        sum(if(types_zero_and_one_in_US > 0, 1, 0)) as non_administrative_in_US,
        sum(if(types_not_zero_and_one_in_US > 0, 1, 0)) as administrative_in_US

   from logged_in_editors
;
-- results:
-- count    namespace_zero  not_namespace_zero  non_administrative  administrative  us_editors_with_bots_in_hadoop_geowiki  us_editors_with_bots_in_old_geowiki
-- 322309   120819          263059              146910              245074          261926                                  271881

 */

hive -e "use wmf;
 select wiki_db,
        sum(distinct_editors) as editors_on_project
   from geowiki_monthly
  where month='2018-03'
    and users_are_anonymous = 0
  group by wiki_db
  order by wiki_db
  limit 10000000
;
" > editors-on-project-2018-03-hadoop.tsv

hive -e "use wmf;
 select country_code,
        sum(distinct_editors) as editors_in_country
   from geowiki_monthly
  where month='2018-03'
    and users_are_anonymous = 0
  group by country_code
  order by country_code
  limit 10000000
;
" > editors-in-country-2018-03-hadoop.tsv

hive -e "use wmf;
 select country_code,
        sum(distinct_editors) as editors_in_country
   from geowiki_monthly
  where month='2018-03'
    and users_are_anonymous = 0
    and wiki_db = 'enwiki'
  group by country_code
  order by country_code
  limit 10000000
;
" > editors-in-country-enwiki-2018-03-hadoop.tsv

Migration From Previous Geowiki System

In summary, the old system ran a query on the cu_changes table, geolocated and aggregated the results, and inserted them into analytics-slave.eqiad.wmnet, staging database, in tables called erosen_... It then output files for easier use. The relevant query is here: https://github.com/wikimedia/analytics-geowiki/blob/17c1360e4c34bdf2242c6e75c46506bc4c2d0917/geowiki/mysql_config.py#L51 and the most important details are:

only includes edits on wikipedia projects
only includes edits on namespace 0
only includes edits from logged-in users
includes any type of action, including administrative ones
uses an outdated geolocation method that misses 40,000 IPs more than our current system
Filters bots in a slightly broken way
- Uses a static list of user ids which is used to filter out any editors from any project, but the list is not keyed by project. This means it will occasionally filter out valid users on user id collisions.
- Also uses the user_groups table, which is good

The new system, for which the relevant query is here: https://github.com/wikimedia/analytics-refinery/blob/ead38243d1bc31e8003c735a16fa320cdda5f016/oozie/mediawiki/geowiki/monthly/insert_geowiki_daily_data.hql#L54, tries to correct some of these problems and expand the available data:

includes edits from all projects
includes all edits, regardless of namespace
includes edits from anonymous users but keeps them separate
includes only non-administrative actions (edits, but not page moves)
uses the same geolocation and data pipeline as all the rest of our datasets, so it's consistent
uses user_groups historical data and user name regular expressions to filter out bots, does not use a static list. This means it may miss some known bots that don't register, but data quality will improve as bot tracking is improved in more structured ways.

Migration consisted of pulling data into HDFS and aggregating it with automatic jobs, then vetting the data manually against the old system. The results of vetting are below, and the new Oozie jobs that coordinate everything can be found here: https://github.com/wikimedia/analytics-refinery/blob/ead38243d1bc31e8003c735a16fa320cdda5f016/oozie/mediawiki/geowiki/

Accessing the old Geowiki data

Geowiki data used to be stored in the staging database of analytics-slave. As of June 2018, legacy Geowiki data from 2012 to April 2018 can be accessed through superset via the geowiki_archive_monthly datasource, a modified version of the erosen_geocode_active_editors_country table in MySQL.

Differences with the original data

The data erosen_geocode_active_editors_country table had daily rows for 30 day data. This means there was a lot of data overlapping across rows, but not daily data:

project	country	cohort	start	end	count	ts
en	Spain	all	2016-02-15	2016-03-16	1084	2016-03-18 12:01:14.0
en	Spain	all	2016-02-16	2016-03-17	1093	2016-03-19 12:01:14.0
en	Spain	all	2016-02-17	2016-03-18	1094	2016-03-20 12:01:22.0
en	Spain	all	2016-02-18	2016-03-19	1094	2016-03-21 12:01:24.0
en	Spain	all	2016-02-19	2016-03-20	1083	2016-03-22 12:01:34.0
en	Spain	all	2016-02-20	2016-03-21	1075	2016-03-23 12:01:48.0
en	Spain	all	2016-02-21	2016-03-22	1067	2016-03-24 12:01:23.0
en	Spain	all	2016-02-22	2016-03-23	1067	2016-03-25 12:01:05.0
en	Spain	all	2016-02-23	2016-03-24	1057	2016-03-26 12:01:31.0
en	Spain	all	2016-02-24	2016-03-25	1062	2016-03-27 12:01:31.0

The geowiki_archive_monthly dataset in druid is monthly, so it doesn't specify start/end, just the month for the row. The method of obtaining the monthly values was querying all rows in the original dataset that had started on the first of the month. This way we have data for this selection between the first day of every month, to the next 30 days. The resulting refined dataset has therefore 1/30 of the rows of the original data. We then adjusted every month to its actual number of days (28, 29, 30 or 31) by multiplying the average daily value of that period by the number of days on that month. This normalization may generate non-integer counts, that are rounded to the closest integer value.

Additionally, all country names have been transformed into the ISO_3166-2 standard (United Kingdom => UK).

Querying the original data

The original data is available in Hive, at the following tables of the archive database:

geowiki_archive_active_editors_world
geowiki_archive_country
geowiki_archive_edit_fraction_city
geowiki_archive_edits_country

Missing archive data

Due to outages, the following months have no data in geowiki_archive_monthly:

October 2017
June 2017
April 2017
March 2017
December 2015

You can find out more about those months in the original dataset in hive, since some days do have 30-day computations.