Analytics/Archive/Geowiki

From Wikitech
Jump to navigation Jump to search

This page describes the legacy Python based-Geowiki system. For the new system introduced in 2018, see Analytics/Systems/Geoeditors


Geowiki is a set of scripts used to automatically analyze the active editors per project/country. The generated data is split into a public part (available through http://gp.wmflabs.org/ (cf. domain description)), and a "foundation-only" part (available through https://stats.wikimedia.org/geowiki-private/ ).

Source code

The source code for the geowiki scripts themselves is at https://gerrit.wikimedia.org/r/#/admin/projects/analytics/geowiki .

The repository holding the generated public data can be found at https://gerrit.wikimedia.org/r/#/admin/projects/analytics/geowiki-data . The repository holding the generated "foundation-only" data can be get synced over to machines by requiring puppet's misc::statistics::geowiki::data::private_bare::sync.

Generated data

The geowiki scripts generate several hundred files. To allow to give a still managable overview, we use ${WIKI_NAME} to refer to names of wikis.

Public data

Dashboards

None.

(The related dashboards/reportcard dashboard is part of the dashboard-data repository. See Global-Dev_Dashboard)

Datafiles

Datasources

Geo

None.

Graphs

Foundation-only data

(Currently, no visualization is offered for this data.)

WIKI_NAME

${WIKI_NAME} is (as of 2013-09-15) any of ab, ace, af, ak, als, am, ang, an, arc, ar, arz, as, ast, av, ay, az, bar, ba, bat_smg, bcl, be, be_x_old, bg, bh, bi, bjn, bm, bn, bo, bpy, br, bs, bug, bxr, ca, cbk_zam, cdo, ceb, ce, chr, ch, chy, ckb, co, crh, cr, csb, cs, cu, cv, cy, da, de, diq, dsb, dv, dz, ee, el, eml, en, eo, es, et, eu, ext, fa, ff, fi, fiu_vro, fj, fo, frp, frr, fr, fur, fy, gan, ga, gd, glk, gl, gn, got, gu, gv, hak, ha, haw, he, hif, hi, hr, hsb, ht, hu, hy, ia, id, ie, ig, ik, ilo, io, is, it, iu, ja, jbo, jv, kaa, kab, ka, kbd, kg, ki, kk, kl, km, kn, koi, ko, krc, ksh, ks, ku, kv, kw, ky, lad, la, lbe, lb, lez, lg, lij, li, lmo, ln, lo, ltg, lt, lv, map_bms, mdf, mg, mhr, mi, mk, ml, mn, mrj, mr, ms, mt, mwl, my, myv, mzn, nah, nap, na, nds_nl, nds, ne, new, nl, nn, no, nov, nrm, nso, nv, ny, oc, om, or, os, pag, pam, pap, pa, pcd, pdc, pih, pi, pl, pms, pnb, pnt, ps, pt, qu, rm, rmy, rn, roa_rup, roa_tara, ro, rue, ru, rw, sah, sa, scn, sco, sc, sd, se, sg, sh, simple, si, sk, sl, sm, sn, so, sq, srn, sr, ss, stq, st, su, sv, sw, szl, ta, te, tet, tg, th, ti, tk, tl, tn, to, tpi, tr, ts, tt, tum, tw, ty, udm, ug, uk, ur, uz, vec, vep, ve, vi, vls, vo, war, wa, wo, wuu, xal, xh, yi, yo, za, zea, zh_classical, zh_min_nan, zh, zh_yue, zu .

To add further wikis, add them in the file geowiki/data/all_ids.tsv.

Dataflow

Dataflow for geowiki

(For an up-to-date version, see https://commons.wikimedia.org/wiki/File:Geowiki_workflow.png)

The big picture of the dataflow in Geowiki is illustrated in the diagram above. The whole dataflow is cut into five separate tasks:

  1. Aggregation
  2. Extraction and formatting for Limn
  3. Fetching
  4. Bringing data in place
  5. Monitoring

Aggregation

This step is responsible of aggregating the editor information (which is available only for 90-days) of the slave databases into a condensed format that is stored in a permanent container.

This aggregation is grouped by projects, editor's country, and date.

The implementation of this step can be found in geowiki's geowiki/process_data.py script. Running this script has been puppetized as misc::statistics::geowiki::jobs::data and is run daily on stat1006 at 12:00 (as of 2013-09-15).

Extraction and formatting for Limn

The aggregated data gets formatted for limn and pushed to a public, and a private data repository by running geowiki/make_and_push_limn_files.py script.

Running this script has been puppetized as misc::statistics::geowiki::jobs::limn and is run daily on stat1006 at 15:00 (as of 2013-09-15).

As it has been decided that not all packages that are required format the limn files will get puppetized, we have to rely on a pre-initialized setup containing those packages. This setup is currently provided by the user qchris on stat1006.

Fetching

Since the computation of geowiki data takes place on a different host, the computed data has to be fetched onto the serving hosts periodically to be able to serve up-to-date data.

For the public data on limn1, up-to-date data if fetched through a cron job that fetches from the geowiki data repository daily at 19:00 (as of 2013-09-15).

For the private data on thorium, up-to-date data if fetched through a cron job that rsyncs the private data bare repository over from stat1006 daily at 17:00 (as of 2013-12-11).

Bringing data in place

The data fetched to limn1 in the previous step relies on absolute paths that are occupied by a different repository on the limn instance. So we have to link the geowiki data into the correct place to make the graphs, etc. work. This linking happens daily at 21:00 (as of 2013-09-15) through a cronjob that runs dashboard-data's blend_in_repository.sh.

We can get rid of this item by finalizing extracting the geowiki parts out of the dashboard repos.

Monitoring

To assure that the data served through limn, and thorium is up-to-date, we check geowiki's data daily and assure that it contains recent enough data, and the contained data is within expected bounds. This monitoring has been puppetized as misc::statistics::geowiki::jobs::monitoring and is currently running on stat1006 daily at 21:30.

See also