WMDE/Analytics/Technical Wishes

From Wikitech

The Technical Wishes team wants to collect product analytics to understand how people use the features we write.  For example, we count the number of times users toggle the syntax highlighting feature, and in which editing interfaces.

Our features emit raw statistics or events into a backend (StatsD or EventGate). The collected data is stored in the analytics cluster.

Statistics (StatsD) are simple enough that they are tallied directly in Graphite, these are usually a counter or gauge.  Any desired segmentation, for example splitting by wiki, must be done at the site of collection.

Events (EventGate) are multidimensional and are collected as discrete payloads, one for each event.  These must be aggregated, reduced down to a small number of simpler metrics.

Finally, we visualize the data, usually as a graphical dashboard in Grafana.

For more help, please visit IRC:

   Libera.chat/#wikimedia-analytics connect to ask about events or the analytics infrastructure.

   Libera.chat/#wikimedia-observability connect to talk about Graphite

Specification

The following template aims to help clearing up the goal and the tools in a structured way. It should simplify the communication between developers and the product team.

Dimensions

dimension possible values include as variable?
byEditor wikitext_2010, wikitext_2017 yes
byEnabled true/false no
... ... ...

Metrics

<name of metric> Toggle ...
description describe metric number of button clicks in the editor (to activate or deactivate syntax highlighting) ...
path Graphite path to metric MediaWiki.CodeMirror.toggles …
dimensions name available dimensions, defined above byEnabled, byWiki, byEditor ...
normalization describe if and how the data should be normalized per session (not per user) ...
graphs name wanted graphs and link to the ones that are done Average activations per session

Activation per session, by wiki

compare toggle activations and deactivators in wikitext_2010 and in wikitext_2017

...

We're also experimenting with this format, written like a right-to-left data flow:

Research questions Dashboard ▭ Panel ▭ ← Aggregate metrics ═ ← Aggregation ◯ ← Collected metrics ═ ← Collected by ◯ ← Initiated by ▭ Status summary History Deprecation plan
Reference Previews Usage reference_previews.baseline.footnote_clicks_per_pageview.$wiki reference-previews/baseline ReferencePreviewsBaseline Cite extension Click on reference footnote Active Starts 2019-11 Drop once the probation period ends. *Note: This period doesn't really exist as a thing, yet.

Data collection

Note: do not create new EventLogging schemas, this is a legacy system.  Use the “modern event platform” schemas (EventGate) instead.

Scheduling

Before deploying any major feature, we believe it's important to collect baseline data for at least one month, to help us understand the impact of our changes.

Deprecation

Data can be continuously collected until no longer needed, but we encourage ourselves to sunset all analytics and remove at the appropriate time.

Sampling

If a very large number of events (for example, all pageviews) would cause events to be recorded, it's important to randomly sample events to keep the volume down. This should be done on a per-session basis when possible, or per-pageview, rather than a per-event basis. In other words, sample across users so that e.g. frequent usage is correctly measured. Use the built-in functions like mw.eventLog.inSample for this. Try to match your sample rate to that of related features, for example VisualEditor is sampling at 6.25%.

Local development

EventGate

  1. Clone the mediawiki-extensions-EventLogging repo.
  2. See EventLogging/devserver/README.md file for up-to-date instructions.

Aggregation

Analytics cluster access

You’ll need at least some level of analytics access in order to work on the aggregation code. Shell access gives the most flexibility, allowing you to use Jupyter and directly run queries against Hive or the MariaDB replicas. Just getting Superset access might be good enough for some minimal cases, this gives you access to the data but through a web UI. The drawback of using Superset is that we don’t expect the queries run there to behave identically when run on the cluster.

Gaining access rights

  1. Read through responsibilities and the access request process.
  2. Generate a new password protected SSH key pair to use exclusively for Wikimedia production access.
  3. File an access request on Phabricator:
    • Generic Phabricator template for access requests.
    • The required access group for our use case is analytics-privatedata-users (details and other groups).
  4. Monitor the ticket for any additional information SRE may need.

Setting up environment

1. Configure SSH access

see SRE/Production access#Setting up your SSH config

2. Connecting to analytics client

In order to facilitate easy file sharing within our team, we should primarily use the same analytics client, in this case stat1005. If your SSH config is configured correctly connecting should be as simple as:

ssh stat1005.eqiad.wmnet


If prompted, verify the corresponding fingerprints:

bast3005: Help:SSH Fingerprints/bast3005.wikimedia.org - Wikitech

stat1005: Help:SSH Fingerprints/stat1005.eqiad.wmnet - Wikitech List of analytics clients: Analytics/Systems/Clients - Wikitech

3. Configuring Kerberos access

When connected to an analytics client you must authenticate with Kerberos prior to running any queries:

kinit

When asked for a password check your email for a temporary Kerberos password (this email should have been sent at roughly the same time production access was granted). You should be prompted to change this password after successfully authenticating. This command needs to be run at least every 5 days to reauthenticate.

4. [Optional] Installing wmfdata-python

wmfdata-python is a helpful library provided by the WMF analytics team which provides an easy way to run both HiveQL and SQL queries. Verify that you have a default venv in your home directory on the analytics client:

ls ~/

If venv is not present:

Analytics/Systems/Jupyter#Resetting user virtualenvs (steps 3. and 4.)

Activate the default venv using:

source ~/venv/bin/activate

In order for the pip install to work you will need to run the following before installing (otherwise you may receive a “Connection to pypi.org timed out” error):

export http_proxy=http://webproxy.eqiad.wmnet:8080
export https_proxy=http://webproxy.eqiad.wmnet:8080


Use pip to install wmfdata-python:

pip install git+https://github.com/wikimedia/wmfdata-python.git@release


Further reading:

Prototype queries

Environments

JupyterLab

JupyterLab instances on the analytics client can be accessed through SSH port tunneling. Disconnect any active SSH session with the analytics client and reconnect using:

ssh -N stat1005.eqiad.wmnet -L 8000:127.0.0.1:8000

Open a browser and navigate to: http://localhost:8000/user/<username>/lab (replace <username> with your shell username)


Login using your wiki-tech credentials.


If you get a “Server not running” error, click “Launch Server” and you should be automatically redirected to the lab interface.

Before being able to execute any queries you will need to log into Kerberos.

In JupyterLab, go to File > New > Terminal and in the new terminal run:

kinit

and enter your Kerberos credentials when asked.


Now you are ready to create a new notebook and get to querying.

In JupyterLab, go to File > New > Notebook and when asked, select the preferred kernel (should be “Python 3”). If you followed the install instructions for wmfdata-python, then this is the kernel which has the python library installed.

Happy querying! (See also examples further below)

For more info on using JupyterLab:

Analytics/Systems/Jupyter#JupyterLab

PAWS

This is another Jupyter notebook platform with access to replicas and dumps, but the frontend is set up for public sharing. See PAWS for more information.

Note: PAWS publishes all of your notebooks and any other files in your home directory publicly to the Internet, for example: https://public-paws.wmcloud.org/User:Adamw/ . It seems that dot-files are skipped, so one workaround for working with sensitive files is to keep them under a ".private/..." directory. Take care not to expose any sensitive data in the output cells.

CLI

Hive queries can be run using beeline, either from the interactive prompt or as a batch script. The database can be selected just like in MySQL, for example “use event”.

MariaDB queries can be run from the mysql command-line, but you’ll need to set up the relevant replica host in ~/.my.cnf, for example:

# enwiki
host = s1-analytics-replica.eqiad.wmnet
port = 3311

The mapping from wiki database to shard is documented at MariaDB#Sections and shards.

To smoke-test a reportupdater-queries script, use a command-line like this:

bash template_dialog_opens 2021-01-13

Superset

Superset is another tool that can be used to smoke test SQL queries. It’s also possible to test Hadoop queries but as they can only be written in SQL (using Presto) instead of HiveQL it’s important to verify that these queries work with Hive before using them in the cronjob.

Although it seems like a good one-stop-shop for querying and visualization we may not be able to replace Grafana with it as it still seems to require users to be in the analytics-privatedata-users access group for both querying and visualizing:

“In both cases, Superset works on behalf of the user, so eventually the username will need to hold read permissions for Hadoop data to correctly visualize what requested. This is guaranteed by being into analytics-privatedata-users, that gets deployed on the Hadoop master nodes (without ssh access) to outline user permissions on HDFS.” from Analytics/Data access.

Along with other unforeseen pitfalls.

Spoof SQL queries locally

When writing queries for production SQL replicas it’s also feasible to initially test these queries against a locally installed database (i.e. MediaWiki dev env). This has the benefit of requiring no extra permissions and will result in faster query times. The downside is lack of testing data and that you must either configure your MediaWiki install appropriately for the queries you are trying to run or create mock database tables/entries.

Using wmfdata-python

Library provided by the WMF analytics team which provides an easy way to run both HiveQL and SQL queries.

Importing
from wmfdata import hive, mariadb
Example MariaDB (SQL)
mariadb.run(
"""
SELECT
    COUNT(*) AS total
FROM
    user_properties
WHERE
    up_property = 'usecodeeditor'
    AND up_value = 1
""",
"enwiki"
)
Example Hadoop (HiveQL)
hive.run(
"""
SELECT
    COUNT(*) AS total,
    event.editor_interface,
    wiki
FROM
    event.EditAttemptStep
WHERE
    event.action = 'ready'
    AND event.editor_interface IN (
        'visualeditor',
        'wikitext-2017'
    )
    AND year = 2021
    AND month = 1
    AND day = 1
GROUP BY
    event.editor_interface,
    wiki
"""
)

Analytics-ReportUpdater (CronJob)

ReportUpdater is a WMF tool which we use to periodically run SQL and Hive queries. The results of these queries are then subsequently stored in Graphite for use with Grafana.

For more information on analytics-reportupdater: Analytics/Systems/Reportupdater

Installing a development branch

Reportupdater normally runs on an-launcher1002, a private analytics instance that we cannot access. To help debug queries, you can also run your own copy of the script. Make sure not to send data to Graphite.

  1. Connect to the stat1005 client
    ssh stat1005.eqiad.wmnet
    
  2. Create a working directory (e.g. reportupdater)
    mkdir ~/reportupdater
    
    cd ~/reportupdater
    
  3. Clone analytics-reportupdater
    git clone https://github.com/wikimedia/analytics-reportupdater.git
    
  4. Clone analytics-reportupdater-queries
    git clone https://github.com/wikimedia/analytics-reportupdater-queries
    
  5. Create a new venv for reportupdater
    python3 -m venv --system-site-packages analytics-reportupdater/venv
    
  6. Activate the new venv
    source analytics-reportupdater/venv/bin/activate
    
  7. Install requirements
    export http_proxy=<nowiki>http://webproxy.eqiad.wmnet:8080</nowiki>
    
    export https_proxy=<nowiki>http://webproxy.eqiad.wmnet:8080</nowiki>
    
    pip install -r analytics-reportupdater/requirements.txt
    
  8. Create an output directory
    mkdir ~/reportupdater/output
    


Running

  1. Connect to the stat1005 client
    ssh stat1005.eqiad.wmnet
    
  2. Authenticate with Kerberos
    kinit
    
  3. Activate the venv
    source ~/reportupdater/analytics-reportupdater/venv/bin/activate
    
  4. (optional) Pull your changes for (or directly modify) analytics-reportupdater-queries
  5. (important) Adjustments in the config.yaml
    If you are working on CodeMirror you would adjust the following config ~/reportupdater/analytics-reportupdater-queries/codemirror/config.yaml
    1. Speed up the query by adjusting all starts dates closer to the current date
    2. ! important ! Remove all mentions of graphite and any of its values (prevents breaking production aggregates in graphite)
    3. ! important ! If you are running SQL queries you will need to replace /etc/mysql/conf.d/stats-research-client.cnf with /etc/mysql/conf.d/analytics-research-client.cnf
  6. Run the queries (example for codemirror)
    python ~/reportupdater/analytics-reportupdater/update_reports.py -l info ~/reportupdater/analytics-reportupdater-queries/codemirror/ ~/reportupdater/output
    

Working on queries

TODO

  • Before any query is added to the Reportupdater-queries it should be smoke tested (verifying that the queries work) using either CLI or JupyterLab
  • Explain the path notation in config.yaml

Gotchas

  • Hive query results must have a column which represents the wiki name
  • SQL query results do not need to have such a column (reportupdater runs the query for each wiki database separately according to what is configured in the config.yaml)

Visualization in Grafana

Grafana is the tool we use to visualize statistics (about grafana and wikimedia | list of grafana functions). Tech Wish dashboards are collected in a folder (also findable with tags techwish and wmde).

Tips for designing a dashboard

Add Information

  • Link to the ticket from the board.
  • Add an introductory text element at the top of the board.
  • Name a responsible person.

Display

  • Create one graph for each metric (and if it makes sense for each dimension).
  • Use variables to include more than one dimension if it makes sense.
  • Do not stack charts and use colors to make it clear it is a stream and not a line graph.

Normalization

  • Use mean numbers and percentages.
  • Use total numbers in addition to display scale differences.

Migration

Any change to the analytics data pipeline must be considered in light of side-effects it may have on already deployed processes.  It’s similar to a database schema, but each step of the pipeline has different fields.  For example, when collection is changed to include a new field, and an old field is deprecated (must omit or send null, both are equivalent), then the aggregation must take into account the field change and should be liberal about accepting either old or new events.

There are strict rules for how a schema can be updated.  A simplified guideline is that old events must still be valid when tested against a newer schema.  Practically, this means that the old legal changes are to add new optional fields, and to add enum options to an existing list.  EventGate includes linting rules that will prevent you from breaking the guidelines.

Install the git commit hook from the top level of the schema repo (e.g. schemas/event/secondary) with npm install.

Edit the current.yaml for your schema, increment the version number, and let the git pre-commit hook take care of propagating the changes into generated files. Once a schema is updated and published, we can update the producer’s extension.json -> attributes.EventLogging.Schemas to target the newer schema revision.

Deployment

Schema changes are deployed by merging patches in the schemas-event-secondary repository. This causes your topic to be automatically pushed to production and causes a schema update in Hadoop.

Production schemas are visible on schema.wikimedia.org, for example. There is a time lag of up to an hour while we wait for puppet to copy new schemas to the machines.

Schema “examples” events specified in the YAML are pushed through a separate “canary” domain roughly hourly, these test that the topic and its validator are working as expected.

Monitor the eventgate-validator dashboard and make sure that the collection code isn’t sending invalid events.

Deprecation

Most of our metrics will be used for short-term questions surrounding product launch. These should be associated with a clean-up ticket, and clearly marked as temporary in source files. When we no longer have need for the metrics, they should be removed starting with the visualization and working backwards to aggregation and collection.

Long-term metrics should be carefully selected, these should be only the most useful indicators and if possible any invasive measurements (e.g. raw user ID or edit count) can be removed from collection.