Jump to content

Test Kitchen/Conduct an experiment

From Wikitech

An experiment is a test of a hypothesis designed to provide trustworthy and generalizable data. It imposes an intervention on subjects with the intention of observing what outcome that intervention leads to.[1] This page describes how to conduct an experiment using Test Kitchen, our experimentation platform.

This guide describes the process for implementing a web experiment using Test Kitchen SDKs (for JavaScript and PHP), configuring the experiment using Test Kitchen UI, analyzing it using our automated analysis, and the next steps after obtaining results. For the manual process of creating an instrument, see Create an instrument. If you are unsure which process to use, reach out in Slack to #talk-to-experiment-platform

A visual representation of the phases of development and evaluation in which Test Kitchen plays a role.

Plan

We have a draft of an experimentation scorecard which provides a template for designing an experiment. If you try it, please give us feedback so we can continue to improve it.

Measurement plan

As you proceed with making decisions about the design of your experiment, we recommend reviewing the answer to the frequently asked question How many wikis and/or users would we need to achieve statistical significance?

Experiment design: identifier type

Currently Test Kitchen supports experiments on two sets of users: all user traffic (where enrollment is based on Edge Unique cookie) and logged-in users only (where enrollment is based on CentralAuth global ID). When configuring the experiment you will be asked for the identifier type that you wish to use, and this is a major component of your experiment design. Your options are:

mw-user
The experiment enrollment authority is MediaWiki and it enrolls users using their CentralAuth global ID. Recommended when it's important to get accurate data about logged-in traffic.
When using mw-user, the user will be consistently enrolled and assigned across all wikis and across all their sessions. The user may log out and log back in, and their enrollment and assignment will not change.
edge-unique
The experiment enrollment authority is our caching system Varnish and it enrolls clients using anonymous cookies (wmf-uniq cookie). Best for experiments on logged-out users.
When using edge-unique, the user will be consistently enrolled and assigned within top-level domain (e.g. wikipedia.org) for the lifetime of their wmf-uniq cookie. If the cookie is cleared manually or automatically – by, for example, a privacy enhanced browser – the client is issued a new wmf-uniq cookie, and they may or may not end up enrolled or even assigned to the same group.
NOTICE: Experiments using this identifier type cannot use server-side instrumentation. Due to the privacy design of Edge Uniques, MediaWiki does not have access to that cookie. These experiments can only collect analytics data using client-side instrumentation.

When to use one or the other: When using edge-unique, a small proportion of subjects in your experiment will be logged-in users. Depending on when and how frequently those users clear their cookies (things to consider: users who only log in inside an incognito window), a volunteer may come in and out of the experiment and even switch groups – because their enrollment is based on the wmf-uniq cookie. For that reason we split analysis of experiments by user authentication status: All, Logged-in only, and Logged-out only. If you need to have trustworthy insights about the effect of your change on logged-in user behavior, we recommend conducting a separate experiment (either in parallel or as a follow-up) specifically on logged-in users, using mw-user as the identifier type.

Experiment design: user traffic per wiki

Traffic allocation % (enrollment sampling rate) can only be increased, not decreased. This is to prevent the scenario where a subject was previously enrolled in experiment but is not enrolled anymore.
Each experiment can only be conducted on a maximum of 100 wikis. Historically product teams have only needed to experiment on 10-20.

Max traffic allocation rules exist when using edge-unique identifier type:

User identifier type: mw_user_id edge_unique
English wikipedia: no max 0.1% max. traffic
Any other wiki: no max 10% max traffic

The maximum traffic allocation on English Wikipedia is 0.1% – which, given the volume of reader traffic, would still be approximately 33K unique devices/day.

The math: English Wikipedia gets about 1B unique devices per month.[2] 1B/30 days is 33.33M/day, and 0.1% of that is 33K/day. For comparison, 0.05% of traffic is ~17K/day; 0.01% of traffic is 3.3K/day).

Domains are not a reliable differentiator of mobile vs desktop (e.g. user can use Minerva Neue (mobile) skin while on the desktop domain), so mobile and desktop variants of domains are bundled. Experiments target wikis, not domains. If you need to target mobile or desktop experiences specifically, you will need to do this with your feature using checks such as MobileFrontend's shouldDisplayMobileView() in PHP. Refer to MobileFrontend's FAQ for more information about detecting mobile view in frontend code.

Experiment design: experiment duration

Our rule-of-thumb and general recommendation is to run the experiment for 1-2 weeks using maximum allowed traffic allocation rates. In many cases this will give you sufficient insights to inform your decision making about the change you are testing.

If you want to know the exact sample size to target with your experiment and ensure a certain trustworthiness of results, we recommend working with a specialist from the Product Analytics team to conduct a power analysis. To do this the analyst will need:

  • A baseline measurement of the primary metric that will be used for evaluating the experiment, including that metric's variance
  • A minimally detectable effect (MDE), which is usually 5% or 10%. Smaller MDE will require a larger experiment. If you powered your experiment enough to detect a 5% lift – because practically speaking that's the smallest lift you care about – but the true lift is 3% or 4%, the experiment would be underpowered for that. It does not mean that you will not be able to detect it at all, but the probability of detecting it is lower.
  • A desired statistical power, which is usually 80%. It is the probability of correctly rejecting null hypothesis (that there is no effect/impact/lift) when there actually is an effect/impact/lift and the null hypothesis should be rejected. More power will require a larger experiment.
  • A significance level, which is usually 5%. It is the probability of incorrectly rejecting null hypothesis (that there is no effect/impact/lift) when there actually is no effect/impact/lift and the null hypothesis should not be rejected. It is the false positive rate, and a smaller significance level will require a larger experiment.

Scenario: Imagine you run 100 experiments of the same feature, all powered to 80% to detect an MDE of 7% with significance level 0.05 (common threshold for p-values) and there actually is a 7% lift from exposure to the change. We can expect about 80% of those experiments to produce results that suggest you ship the change, and about 20% of those experiments to produce results that suggest you do not ship the change. When you conduct an experiment, you do not know if your experiment-powered-to-80% will be one of the 80% or one of the 20%.

Our recommendation is to run the experiment for 1 week, or 2 weeks if you need more data. A week-long experiment will account for weekday and weekend seasonalities. You may have very different profiles of users who use your feature on weekdays than you do on weekends, so if you conduct an experiment on a Tuesday, you may get very different results than if you were to conduct it on a Friday or a Saturday. By conducting the experiment for a full week, it allows you to include both profiles of user for more trustworthy results.

Note for all user traffic experiments (using edge uniques for enrollment): If your experiment's primary metric relies on the edge unique-derived subject identifier, that metric (e.g. reader retention rate) can only be measured within an experiment. This means that one of the key ingredients for power analysis – a baseline measurement of the primary metric – is not readily available. To address this challenge we recommend conducting a pilot study: a small scale A/A test (an experiment where the treatment group is a second control group and no change is tested) on the same set of wikis you are planning to target with your experiment. This will yield a baseline (with its variance) and a rate of user traffic, which you can then use to determine the duration your experiment should be to obtain a desired sample size.

Instrumentation specification

Once you have a measurement plan, the next step is to create an instrumentation specification ("spec") (template). The instrumentation spec defines all the data you'll collect with your instrument. The spec is also a useful tool for engineers to ensure that all events are being produced and received correctly. For a template and examples of instrumentation specs, see the folder on Google Drive.

Key pieces of information:

  • machine-readable experiment name
    • Instrumentation will use this to determine if user is a subject enrolled in the experiment
    • This will be recorded under experiment.enrolled in event data
    • This is needed when registering experiments for automated analysis
  • machine-readable treatment group name
    • Instrumentation will use this to determine if subject is assigned to the group
    • This will be recorded under experiment.assigned in event data

Data collection guidelines

We've designed the contextual attributes that are set for product_metrics.web_base stream to result in a low risk data collection activity. If your experiment is using product_metrics.web_base stream, please select "low risk"

If you are using a different stream for your experiment, please assess the risk level as per https://foundation.wikimedia.org/wiki/Legal:Data_Collection_Guidelines

Code

You can write your experiment code in the WikimediaEvents extension or in your product codebase. See API Docs to see how Test Kitchen can help you and this guide if you want to see how to use a standard clickthrough rate instrument to measure a clickthrough rate and run a simple A/B test in MediaWiki using our experimentation platform.

Scenario for code examples:

  • The experiment is named "Larger default font size", with machine-readable name "larger-default-font-size"
  • The treatment group is named "X Large", with machine-readable name "x-large"

CSS classes

This is for changing appearance of elements based on user's enrollment and assignment. You would still need to instrument analytics so you can assess the effect of appearance changes on user experience and behavior as measured by some metrics.

To make it easy to test appearance changes we automatically add the following classes to the <body> element, no matter which skin is in use:

  • xlab-experiment-{experiment_name}
  • xlab-experiment-{experiment_name}-{group_name}

Where {experiment_name} and {group_name} are machine-readable names of the experiment (that the user is enrolled in) and group (that the user is assigned to), respectively.

Based on the scenario, you can then have CSS that uses Codex's font size design tokens and looks something like:

.xlab-experiment-larger-default-font-size-x-large {
    font-size: var( --font-size-x-large );
}

Users assigned to the treatment group would automatically have that style applied to <body> on every page they view.

We reserve an explicit control group for all experiments. Control group should have the same experience as users who are not enrolled in the experiment have. So in this scenario, users in the control group would have a <body> with class xlab-experiment-larger-default-font-size-control.

Server-side instrumentation

Server-side analytics instrumentation (data collection) is not supported for experiments that use edge-unique identifier type for enrollment due to the privacy design of the edge unique cookie (it is not accessible by MediaWiki). For experiments conducted on all user traffic you can only collect data using client-side instrumentation.

To retrieve enrollment and assignment information:

const EXPERIMENT_NAME = 'larger-default-font-size';
const TREATMENT_GROUP_NAME = 'x-large';

$experimentManager = MediaWikiServices::getInstance()->getService( 'MetricsPlatform.XLab.ExperimentManager' );

$experiment = $experimentManager->getExperiment( EXPERIMENT_NAME );

Server-side feature toggling

When you implement the change to your feature server-side, you condition it on the user being enrolled in the experiment and assigned to the treatment group which will receive the change:

if ( $experiment->isAssignedGroup( TREATMENT_GROUP_NAME ) ) {
    // Code that only runs for subjects in the treatment group
} else {
    // Code that runs for subjects in the control group and
    // users who are not enrolled in experiment
}

Server-side analytics

Server-side analytics instrumentation (data collection) is only supported for logged-in user experiments. For privacy reasons, experiments on all user traffic (which use edge unique cookie for enrollment) must use client-side analytics instrumentation to collect data.

To collect data for the experiment:

// Example of logging a page-visited action
$experiment->send(
    'page-visited', // action
    [
        'instrument_name' => 'PageVisit'
    ]
);

If the user is enrolled in the experiment, this will produce an event to the product_metrics.web_base stream using the latest version of the web base schema (1.4.2 at the time of writing). This event will have all the experiment membership information, such as the name of the experiment and the group they are assigned to. If the user is not enrolled in the experiment, this will do nothing.

Client-side instrumentation

Client-side instrumentation is the only way to collect data for an experiment conducted on all user traffic (using edge-unique identifier type for enrollment) due to the privacy design of the cookie (it is not accessible within MediaWiki).

To retrieve enrollment and assignment information:

const EXPERIMENT_NAME = 'larger-default-font-size';
const TREATMENT_GROUP_NAME = 'x-large'; // only needed for feature toggling, not analytics

const experiment = mw.xLab.getExperiment( EXPERIMENT_NAME );

Client-side feature toggling

When you implement the change to your feature client-side, you condition it on the user being enrolled in the experiment and assigned to the treatment group which will receive the change:

if ( experiment.isAssignedGroup( TREATMENT_GROUP_NAME ) ) {
    // Code that only runs for subjects in the treatment group
} else {
    // Code that runs for subjects in the control group and
    // users who are not enrolled in experiment
}

Client-side analytics

To collect data for the experiment:

// Example of logging a page-visited action
experiment.send(
    'page-visited', // action
    {
        instrument_name: 'PageVisit'
    }
);

If the user is enrolled in the experiment, this will produce an event to the product_metrics.web_base stream using the latest version of the web base schema (1.4.2 at the time of writing). This event will have all the experiment membership information, such as the name of the experiment and the group they are assigned to. If the user is not enrolled in the experiment, this will do nothing.

Using custom schema and/or stream

Using custom schemas and streams is currently only supported by the JS SDK for client-side instrumentation. Base schema/stream cannot be overridden when using the PHP SDK for server-side instrumentation.

By default, events produced by experiments use the analytics/product_metrics/web/base schema and flow into the product_metrics.web_base stream. The base stream's current configuration can be viewed at any time via the Stream Configs API and its entry in Event Streams config.

If your experiment requires a different set of contextual attributes to be collected than the ones collected by the base stream, you would need to configure a custom stream – but can still use the base schema. If your experiment requires collecting data that is not supported by the base schema, you will need to create a custom schema and a custom stream; refer to this guide on custom schemas for instructions.

Considerations

  • Preferably, the custom stream is opted out of User-Agent string collection (using this mechanism).
  • Preferably, the custom stream does not include performer_name contextual attribute, because we try not to associate usernames with subject IDs.
  • The custom stream may need to be configured for edge uniques (if your experiment will use that for enrollment).
  • The resulting table in Hive would not be allowlisted for event sanitization and would thus be subject to to 90 day data retention.

Stream configuration

If you plan to collect data from an experiment that uses edge uniques for enrollment, the stream must be configured to use_edge_uniques. Otherwise experiment.subject_id will be set to "awaiting" for all events flowing into the stream.

Backport a MediaWiki configuration change as follows.

'wgEventStreams' => [
	'default' => [
        // ...

        // Stream name.
        'mediawiki_product_metrics_translation_mint_for_readers_experiments' => [
            // Schema name.
            'schema_title' => 'analytics/product_metrics/web/translation',
            'destination_event_service' => 'eventgate-analytics-external',
            'producers' => [
                'eventgate' => [
                    'enrich_fields_from_http_headers' => [
                        // Collect the user agent.
                        // Disabled by default.
                        'http.request_headers.user-agent' => 'user-agent',
                    ],
                    // Target logged-out users.
                    'use_edge_uniques' => true,
                ],
                'metrics_platform_client' => [
                    // Contextual attributes.
                    // See https://wikitech.wikimedia.org/wiki/Experimentation_Lab/Contextual_attributes
                    // NOTE `agent_client_platform` and `agent_client_platform_family`
                    //      are automatically added.
                    'provide_values' => [
                        // We recommend collecting `performer_is_logged_in` and `performer_is_temp` at the same time.
                        // See https://phabricator.wikimedia.org/T374812#10953216
                        'performer_is_logged_in',
                        'performer_is_temp',

                        'performer_pageview_id',
                        'mediawiki_database',
                    ],
                ],
            ],
        ],
    ],
],
'wgEventLoggingStreamNames' => [
	'default' => [
        // ... 
        'mediawiki.product_metrics.translation_mint_for_readers.experiments',
    ],
],
'wgMetricsPlatformExperimentStreamNames' => [
    'default' => [
        // ...
        'mediawiki.product_metrics.translation_mint_for_readers.experiments',
    ],
],
Example scenario

We want to run experiments with the MinT for Wiki Readers feature/initiative. Since we need to collect special data such as source and target languages, we will use the analytics/product_metrics/web/translation schema. Now, there is already a mediawiki.product_metrics.translation_mint_for_readers stream configured, so why should we not use it? Because:

  • It has not yet been opted out of User-Agent string collection
  • That stream is configured to collect performer_name which we would rather avoid for experiments using edge uniques for enrollment
  • Stream has not been enabled for edge uniques
  • The corresponding table mediawiki_product_metrics_translation_mint_for_readers has been allowlisted for event sanitization, so the data is retained in perpetuity, albeit in a sanitized way (with some columns cleared and some identifiers hashed).

So in this case the best thing to do would be to create a separate stream just for experiments, reassess which contextual attributes we need for experiments, and include additional configuration to make it suitable for data collection from all user traffic experiments.

Instrumentation

Next we need to override the default schema ID and stream when we initialize experiment:

const experiment = mw.xLab.getExperiment( EXPERIMENT_NAME );
experiment.setSchema( '/analytics/product_metrics/web/translation/1.4.2' );
experiment.setStream( 'mediawiki.product_metrics.translation_mint_for_readers.experiments' );

When experiment.send() is called, the events will declare our custom schema and be produced to our custom stream.

Test

This section covers testing of experiments. Refer to Event Platform/Instrumentation How To § Setting up for local testing for general instructions for testing analytics instrumentation locally.

A complete testing workflow should target 3 environments:

  1. local development via #HTTP header
  2. Beta cluster via #Enrollment override
  3. production wikis via #Enrollment override

HTTP header

Make sure you have an Test Kitchen/Local_development_setup up and running.

The X-Experiment-Enrollments: {experiment_machine_readable_name}={assigned_group}; HTTP request header mocks enrollment in the given experiment. Example value: larger-default-font-size=x-large.

Instructions:

  • on Google Chrome, install Inssman
  • add a rule that sets the HTTP header
    • hit Create Rule, then Modify Header
    • If Request - Contain - http://localhost
    • Operator - Set - Request - X-Experiment-Enrollments - {experiment_machine_readable_name}={assigned_group}
    • hit Create
  • go to your feature URL

You can change {assigned_group} to switch between treatment and control groups.

The experiment.sampling_unit event field will be set to edge-unique and experiment.subject_id will be set to awaiting, but that's not important for testing variations of your feature. We use this mechanism for testing locally even if the plan/intention is to conduct the experiment on logged-in users only. Rest assured that the experiment properties will be set correctly by our system when the experiment is deployed to production and configured in Test Kitchen UI.

Enrollment override

This method won't fire events, it will only log action and interactionData as per Experiment#send's parameters in the browser's JavaScript console.

You have 2 options:

Browser's console

mw.xLab.overrideExperimentGroup() overrides enrollment when you are logged in. The override will persist between page views within your session. Sessions on the Web are… complicated. Usually a session ends when the browser process is terminated. See Data Platform/Sessions#Web for more details.

Invoke the method in your browser's JavaScript console. For example:

mw.xLab.overrideExperimentGroup( 'larger-default-font-size', 'x-large' );

Manually clear the override through mw.xLab.clearExperimentOverride(). For example:

mw.xLab.clearExperimentOverride( 'larger-default-font-size' );

Or clear all overrides through mw.xLab.clearExperimentOverrides():

mw.xLab.clearExperimentOverrides();

URL query parameter

The ?mpo={experiment_machine_readable_name}:{assigned_group} URL query parameter overrides enrollment when you are logged in or out. The override will not persist between page views: if you navigate to another page, you'll have to set the parameter again.

Append the query parameter to your feature URL. For example:

?mpo=larger-default-font-size:x-large

Launch

How to turn an experiment on in Test Kitchen UI (test-kitchen.wikimedia.org)
We require all experiments to be scheduled (configured) and turned on at least 24 hours in advance of their start date.

Once your experiment code has been deployed to production, configure your experiment in Test Kitchen UI to configure it and start serving changes to the users and collecting data.

Note that once the experiment is scheduled, it will not start collecting data until you turn it on.

Experiments are active if they are turned on and if the current time is between start date's noon (UTC) and end date's noon (UTC).

For example: if an experiment is scheduled for 2025-06-02 – 2025-06-09 and has been turned on, it will only be active between 2025-06-02T12:00:00Z and 2025-06-09T12:00:00Z.

Note that once an experiment is turned on, you won't be able to perform the following:

  • Change name, machine-readable name, user identifier type, treatment name, machine-readable treatment name, start date or risk level/security and legal review
  • Add or remove wikis
  • Decrease the traffic for any of configured wikis (increase the traffic is allowed)

Emergency shutdown

If something is wrong, you can turn the experiment off in Test Kitchen UI using the same menu you used to turn the experiment on.

For everyone (all user traffic based on edge unique cookie) experiments, the change will take about 3 minutes to fully propagate through our caching centers and for no new clients to be enrolled.

For logged-in users experiments, the change will be almost immediate (<1min) in MediaWiki.

Evaluate

For automated analytics MVP only: Results for analyzed experiments are pre-computed and remain available in perpetuity, even after the underlying raw interaction data has been deleted per the 90 day retention policy.

Assuming your experiment has been instrumented with the Test Kitchen SDK and conducted with the Test Kitchen UI, you can leverage our automated analysis of experiments. Only experiments configured in xLab and with configured streams can be registered for automated analysis.

When your experiment goes live, if it has been registered for automated analysis then our system will analyze it and the results of that analysis will available at https://superset.wikimedia.org/superset/dashboard/experiment-analytics/ within 3 hours of your experiment going live. (It takes about 2.5 hours for event data to be processed after it is received by EventGate.) Experiment results are then updated hourly for the duration of your experiment.

For all user traffic experiments (where enrollment is based on edge unique cookie), we automatically perform 3 analyses: all users, only logged-in users, and only logged-out users. Please ensure that performer_is_logged_in is one of the contextual attributes in your data collection.

Aftermath

Make decision

If the experiment informed a decision, please document it in Metrics Platform/Decisions informed by experiments for transparency, visibility, and because we use that log to track our success of enabling data-informed decision making through shared experimentation tools and infrastructure.

Share results

The experiment health pane on the dashboard will let you know whether the results of the experiment are suitable for publication on wiki or other public location, depending on number of subjects in the experiment. Please refer to data publication guidelines for more information about publishing data.

Clean up

Remember to decommission the instrument/experiment (using this guide) when you are done.

If you decided to not ship the change based on the results of the experiment, remember to remove the change from the codebase.

Convert experiment to baseline instrument

We do not yet have automated analytics for baseline instruments which collect data for measuring product health metrics. You would need to create a Superset dashboard yourself. Please see this guide for more details.

If you decide to ship the change and collect data on an ongoing basis to measure the metric(s) as a product health metric(s), you can convert the experiment to a baseline instrument:

- const EXPERIMENT_NAME = 'larger-default-font-size';
- const experiment = mw.xLab.getExperiment( EXPERIMENT_NAME );
- experiment.send( 'page-visited', { instrument_name: 'PageVisit' } );
+ const INSTRUMENT_NAME = 'page-visits'
+ const instrument = mw.xLab.getInstrument( INSTRUMENT_NAME );
+ instrument.submitInteraction ( 'page-visited' );

And then configure the instrument "Page Visits" (machine-readable name: page-visits) in Test Kitchen UI. Please refer to this guide for more information on measuring product health.

Case study: synthetic A/A test

Since this was an A/A test, not an A/B test, we did not vary the user experience and did not need to implement any feature toggling (making execution of some code conditional on user being enrolled in experiment and assigned to treatment group).

This section collects links and information related to the Experiment Platform team's first end-to-end test of the Test Kitchen – and edge unique-based enrollment of subjects – for reference. If you conducting an experiment for the first time, it may be helpful to review all the artifacts.

FY24/25 SDS 2.4.11 Synthetic A/A Test artifacts
Phabricator Documentation Instrumentation Experiment
Epic Measurement Plan (WMF only) Instrumentation Spec (WMF only) Task, Code Configuration, Results

Analytics instrument was added to WikimediaEvents in this patch. Note that in addition to the instrument code itself, it also had to be registered with ResourceLoader – see ResourceLoader/Developing with ResourceLoader § JavaScript for more details.

The "Page visits" metric was defined in this patch, where the experiment was also registered for automated analysis. It is okay to register an experiment for automated analysis in advance because the analysis will not be attempted until the experiment has actually started.

The experiment was configured in Test Kitchen UI. The machine-readable name of the experiment (specified in the instrumentation specification) is the most important identifier and had to match the experiment's entry in the registry when it was registered for automated analysis.

References

  1. mw:Product_Analytics/Glossary#Experiment
  2. https://stats.wikimedia.org/#/en.wikipedia.org/reading/unique-devices/normal%7Cline%7C2-year%7C(access-site)~mobile-site*desktop-site%7Cmonthly