Jump to content

Experimentation Lab/Automated analysis of experiments/Converting queries for product health monitoring

From Wikitech

If you conducted an experiment with Experimentation Lab, analyzing it with the automated analytics system, and decided to ship the tested treatment to all users, you may be interested in continuing to measure the metrics you measured during the experiment.

Currently we do not have an automated analytics system for product health monitoring like we have for experiments. We are exploring options in this problem space as part of SDS 2 objective in FY25/26.

In the meantime, this guide will help you get started with a custom solution that you will own. This guide assumes that the necessary steps have been followed to convert the instrumentation used in the experiment to baseline instrumentation that is collecting data on an ongoing basis.

Suppose we wanted to have a Superset dashboard for monitoring the metric we care about.

Measurement options

Between the data being collected by an instrument and a metric that is visualized in Superset is an important step: data processing/wrangling. You need to decide how you want to turn interaction data into measurements of a metric. There are two options, each with trade-offs that must be considered.

Option 1: Presto-based dashboard on top of raw interaction data (recommended)

Positives:

  • Quick to set up, assuming you know how to query the data lake using Presto SQL.
  • Easy to make small adjustments to how the metric is calculated or change/add dimensions.
  • No maintenance costs.

Negatives:

  • Metrics are calculated on-demand whenever dashboard is accessed. Performance can deteriorate with a large enough table. If you are collecting a lot of data, you may run into timeouts when you load the dashboard.
  • Due to the Foundation's data retention policy, you are limited to 90 days of data. Beyond that you would need to use the sanitization system.
    • If queries are updated to get data from both event and event_sanitized databases, same performance considerations as with large volume of data.
    • There are other considerations, so please refer to the section below.

Option 2: Airflow-based data pipeline, Superset on top of aggregated data

Positives:

  • Dashboard loads quickly because calculations are offloaded to a pipeline rather than being done on-demand, so the dashboard is just reporting precalculated metrics.
  • Easy to backfill.
  • Can publish public version of the dataset to share with volunteers, since community does not have access to Superset.

Negatives:

  • Steep learning curve for getting started with Airflow if you have not used it before.
  • Somebody needs to own/maintain the data pipeline.
  • Need to determine dimensions ahead of time.

Converting a metric from metrics catalog

If your experiment's metric relies on an edge unique-derived identifier (e.g. to measure retention of logged-out readers over the span of 2 weeks), you will not be able to convert that metric to a product health metric. Outside of experiments, the only identifiers available to you are: session ID and MediaWiki username/user ID.
Superset's SQL Lab uses a dialect of SQL called Presto that has its own set of functions different from Spark SQL's (aka HiveQL) and MariaDB SQL's. Query templates in the metrics catalog are written in Spark SQL, intended to be executed with PySpark. Translating a query from Spark SQL to Presto may involve replacing functions and modifying syntax.

Let us pick one of the metrics from the metrics catalog – say, Clickthrough to a Page History page per Watchlist visit – and convert the query for use in our Superset dashboard. This guide assumes you have chosen our recommended option 1 and that the instrumentation is staying the same (not deviating from the instrumentation specification that corresponds to this metric definition).

The logic of the query is:

  1. For each subject (user), count how many Watchlists visits had edits shown (something to click on) and count how many of those visits resulted in at least one click on a Page History link.
  2. Calculate the CTR for each subject (user).
  3. Average across all subjects' (users') CTRs in the group to get the average CTR for the group.

Thus, we have an CTR for the control group and the treatment group.

That logic is formalized in the query template for that metric is:

WITH per_page_interactions AS (
  SELECT
    IF(experiment.assigned = 'control', 'control', 'treatment') AS variation,
    experiment.subject_id,
    performer.pageview_id,
    SUM(IF(action = 'page-visited', 1, 0)) > 0 AS could_click,
    CAST(SUM(IF(action = 'click', 1, 0)) > 0 AS INT) AS clicked_through
  FROM {table}
  WHERE action_source = 'Watchlist'
    AND (
      (
        action = 'page-visited'
        AND GET_JSON_OBJECT(action_context, '$.hc') = 'y'
      )
      OR (
        action = 'click'
        AND GET_JSON_OBJECT(action_context, '$.link') = 'page-history'
      )
    )
    AND instrument_name = 'RCClickTracker'
    AND {where_boilerplate}
  GROUP BY 1, 2, 3
)
SELECT
  variation,
  subject_id,
  AVG(clicked_through) AS outcome
FROM per_page_interactions
WHERE could_click
GROUP BY 1, 2

This is a metric used in an experiment conducted on logged-in users. We would need to make sure that the baseline instrument collects performer_name contextual attribute because we will use that in place of a subject ID (derived from the MediaWiki user's central ID during the experiment). Suppose we wanted to calculate the average CTR for each day. We can modify the query template above to the following query:

WITH per_page_interactions AS (
  SELECT
    CAST(FROM_ISO8601_TIMESTAMP(meta.dt) AS DATE) AS dt,
    -- ^ or something to this effect to convert timestamp to a day
    performer.name AS username,
    performer.pageview_id,
    SUM(IF(action = 'page-visited', 1, 0)) > 0 AS could_click,
    CAST(SUM(IF(action = 'click', 1, 0)) > 0 AS INT) AS clicked_through
  FROM event.mediawiki_product_metrics_rc_watchlist_baseline
  WHERE action_source = 'Watchlist'
    AND (
      (
        action = 'page-visited'
        AND GET_JSON_OBJECT(action_context, '$.hc') = 'y'
      )
      OR (
        action = 'click'
        AND GET_JSON_OBJECT(action_context, '$.link') = 'page-history'
      )
    )
    AND instrument_name = 'RCClickTracker'
  GROUP BY 1, 2, 3
), per_user_ctr AS (
  SELECT
    dt,
    username,
    AVG(clicked_through) AS ctr
  FROM per_page_interactions
  WHERE could_click
  GROUP BY 1, 2
)
SELECT
  dt,
  AVG(ctr) AS ctr
FROM per_user_ctr
GROUP BY 1

Notable differences:

  • Replaced variation with a date-type dt column for calculating average CTR within each day.
  • Replaced experiment.subject_id with performer.name.
  • Added an aggregation step – original query is intended to produce a set of per-user outcomes, but here we want to average across all users to produce a single measurement of the CTR for each day.

Long term monitoring

As noted above, if you want to see how the metric behaves beyond the most recent 90 days, you will need to either create a data pipeline that produces a dataset of measurements or you will need to sanitize the interaction data and adjust your Superset dashboard's Presto queries to also query the sanitized event database.

If you think you might utilize the event sanitization system (to retain interaction data in perpetuity), you should configure a separate stream for the data you are collecting. Currently we have a default stream product_metrics.web_base (corresponding table: event.product_metrics_web_base) that is shared by different instruments and experiments. We have no plans to start sanitizing that table and retaining its data beyond 90 days.

By configuring a separate stream, you will be able to create an entry for it in the allowlist and have fine control over which columns to keep/drop/hash.

Creating a Presto-based dashboard in Superset

Superset's SQL Lab uses a dialect of SQL called Presto that has its own set of functions different from Spark SQL's (aka HiveQL) and MariaDB SQL's. Query templates in the metrics catalog are written in Spark SQL, intended to be executed with PySpark. Translating a query from Spark SQL to Presto may involve replacing functions and modifying syntax.

Steps for creating a simple chart for use in a dashboard:

  1. Write query in SQL Lab. Be sure to use presto_analytics_hive database, not presto_analytics_iceberg and pick event schema.
  2. Click RUN. Your query needs to successfully produce a set of results.
  3. Click CREATE CHART. You will be taken to a chart creation page where you can select type of chart (e.g. line chart).
  4. Once you are satisfied with your chart, click SAVE. You will need to name your dataset and your chart. You can at this point add the chart to an existing dashboard or create a new dashboard.

The dataset will be a virtual dataset, created from the query. Whenever the dataset is accessed – for example, when opening the dashboard – Superset will execute the query. Multiple charts can use the same virtual dataset but Superset will not execute the query once and reuse the results in each chart. Instead, Superset will execute the query for every chart. This is where the point about performance and volume of data and complexity of the query becomes very important.