Test Kitchen/Troubleshooting
This page provides troubleshooting instructions for Test Kitchen. It is relatively new so there haven't been many incidents to draw troubleshooting instructions from. Instead, this page provides a map of components of the system, guidance for finding experiment configuration details, and dashboards to review. Please add to the end of the page if you encounter incidents; these may help someone (or yourself) if similar things happen in the future.
Also, be aware of Test Kitchen/Test Kitchen UI/Administration for additional tips for operators conducting administration of the Test Kitchen UI and Test Kitchen API (test-kitchen.wikimedia.org and test-kitchen-next.wikimedia.org).
SLOs
Troubleshooting
Below is the most recent architecture diagram for Experimentation Lab. Clicking on a component or group of components will take you to a section about that component. Each section contains a brief description of the component, notes about monitoring it, and symptoms you might see in production with remediations.

A note about 14:30 UTC
Experiment configuration is generally set to go in force on a start date and end on an end date at 14:30 UTC (on both dates) for pre-planned experiments. When you see a start date or end date in Test Kitchen UI remember this. You can see a recurring "Test Kitchen UI" experiment window associated with Deployments for this purpose (heads up, daylight confusion time can cause confusion in the schedule, but the TK configuration API use 14:30 UTC).
It is possible for manual configuration to change a start date to be sooner, which can have the practical effect of "starting" an experiment (at least if the experiment is marked as On, which is what makes it eligible to be in effect in the first place, and we're in the middle of the start-end interval). So, when looking at logs, it's pretty typical for things to start showing up or ending around 14:30 UTC for experiments, but it's also possible for them to seemingly start at arbitrary times. Thus you may need to triangulate in the regular SAL, the analytics SAL, and other git history.
Are experiments involved and causing problems?
When the MetricsPlatform extension determines experiments may be involved, it will add a context_ab_tests data structure to Logstash entries (as context.ab_tests in code) via the BeforePageDisplay hook called early in the page lifecycle; the context_ab_tests field may also have an empty value if no experiments are seemingly in play.
For example, see this Logstash dashboard snapshot where a filter is applied looking for the existence of context_ab_tests (you may want to apply a similar filter while using or building other dashboards)

If you only want to see Logstash entries where a user was actually enrolled into an experiment (that is, according to the MediaWiki application server's enrollment routine in Test Kitchen extensionware), you likely want to set a custom filter to ensure the existence of context_ab_tests.enrolled.


If you are troubleshooting and see a correlation in Logstash with errors associated with specific A/B tests, you should take the following steps. We'll use an example here.
1. Identify where in code and configuration the the experiment may be triggered. This may be obvious from a stack trace, or it may not. Notice the value(s) of context_ab_tests.active_experiments and search in Codesearch. In these example screenshots you see a slug (machine readable name) of growthexperiments-get-started-notification. Take note of the enrolled property and the assigned property and the mapping with the pertinent experiment(s) involving Logstash entries and the corresponding code.

2. Access Test Kitchen UI (formerly known as Experimentation Lab / xLab / MPIC) and search for the slug.

3. Notice that the experiment is intended to be On and that it starts 25th September 2025. Click on the experiment hyperlink.

4. The User Identifier Type is mw-user, which means that inclusion in the experiment is based on sampling using the user ID for the wikis indicated in the Traffic section. Notice that the experiment runs through 15th December 2025 and that it is intended to run on five Wikipedias at 100% sampling. 100% sampling in this context means that when extension code (in this, GrowthExperiments) checks for the presence of this experiment, all logged in users can be considered for inclusion in the A/B test.
If the User Identifier Type is edge-unique, this means inclusion is based on sampling at the CDN edge (Varnish) via the edge unique cookie. The sampling rates at the CDN edge need to be much lower in order to keep the edge cache performant.
You can learn more about these "enrollment authorities".
5. You'll notice there is a link in Test Kitchen to a Phabricator ticket where you can learn more information about the experiment. You can also view the list of active and archived experiments on-wiki to look for more documentation and who to potentially contact from a team conducting an experiment.
6. If you determine that the errors require action relatively soon, for example because of triggering noticeable buggy behavior, file a task on Phabricator and tag the task with the software development team and software component (usually an extension) executing the experiment, and then link to it on the #talk-to-experiment-platform channel on Slack. In your Slack message please include the task number and name of the task for findability, at-mentioning software developers involved with the experiment. Members of the Experiment Platform Team watch this channel regularly and can assist with troubleshooting with the software development team if necessary. If you don't have access to Slack, please feel free to reach out to members who are on #wikimedia-operations or #wikimedia-analytics on Libera IRC if you know their handles (including phued.x, cjmin.g, or dr0ptp4k.t whose bouncer usually catches and notify about messages).
7. If you determine that the errors are serious and require urgent treatment, follow the previous step, and please escalate by additionally at-mentioning engineering management and product management from both the software development team and Experiment Platform Team on Slack as well (DMs and adding people to the thread on #talk-to-experiment-platform are encouraged). If necessary, escalate further on #talk-to-data-engineering and #engineering-all on Slack with a pointer to your Slack message on #talk-to-experiment-platform.
8. Deactivation of an A/B test is an option in exceptional circumstances. Deactivation can invalidate assumptions about data collection for making important decisions about product features. Ideally you will reach agreement on Slack and an ad hoc Meet if this is necessary. If you're lucky you may find a way to roll forward with a hotfix instead.
However, if there is a threat of application data corruption or a site outage and you've been unable to reach people even through escalations, you may need to take measured action. Please state clearly on Slack and on the Phabricator task that you will do so. Wait a few minutes if you can, and then go back to the experiment in Test Kitchen, click on the three dot overflow button for the pertinent experiment, and choose Turn Off. When prompted, please carefully verify that you are turning off the correct experiment, then confirm your choice; then notify people that you have done so on Slack and on the Phabricator task.

Note: The Test Kitchen UI requires authentication via CAS-SSO requires a Wikimedia Developer Account and membership in the wmf or nda group.
It is sometimes easier to functionally deactivate an experiment through a Gerrit change. It's still important if going this route to over-communicate your intent to do so and to update people once you have done so.
Experiment configuration history
IRC and analytics Server Admin Log
Basic IRC logging of experiment configuration changes for experiments that are turned On or Off, or are in an On state but get a change (e.g., experiment end date), occurs on Libera IRC chat on the #wikimedia-analytics channel via the nickname wmftkbot. A continuous Toolforge tool job is run from user tk in order to poll the Test Kitchen API and log these messages.

wmftkbot logging that a change happened to an experiment configuration.You can find an archive of this logging activity at mw:Analytics/Server_Admin_Log and the Toolforge analytics SAL search.
The bot logs experiment related changes for both mw-user and edge-unique experiments. It does so by checking the Test Kitchen API for each type in an alternating manner every ten seconds.
When it logs, it points out adds (experiments that are made to be On), removes (experiments that have passed their end date or that have been made to be Off), and the most notable changed fields (especially end date, but potentially - via manual intervention - also start date, sample rate, and variants/groups).
This bot runs on a continuous basis. It shows the polling round where it identified the change as (poll #). Because the bot can restart and it's meant to be an application without persisted state, the poll value can reset to 1 if the bot is restarted for any reason. Typically (poll 1) will show all experiments that are indicated as On as of the time of that poll because the application doesn't attempt to reconstruct history; it is starting from a fresh slate.
Similarly, structural changes to Test Kitchen API response shape can result in bot logged messages. For example, on 7th November 2025 "mdot" domains (e.g., en.m.wikipedia.org) were removed from the TK API responses for edge-unique experiments, whereas previously they were included by default (to ensure uniform treatment for small form factors and large form factors alike). This resulted in a logged message. And as there was only one active edge-unique experiment with an On disposition whose end date hadn't passed, it was flagged as having a change.
wmftkbot https://toolsadmin.wikimedia.org/tools/id/tk
!log Test Kitchen edge-unique experiments (poll 4689) - adds: none; removes: none; fields: fy2025-26-we3.1-image-browsing-ab-test, hcaptcha-on-french-wikipedia, xlab-mw-module-loaded-v2 - xLab/MPIC/TK tips at https://w.wiki/FwuD
For the curious, the mdot domains were removed from the API because such mdot domains are no longer in use as of the autumn (northern hemisphere) 2025, except for legacy purposes of redirects. Traffic normally now flows exclusively through what used to be termed "desktop" domains (e.g., en.wikipedia.org).
Bot logs
IRC can be lossy, and it's possible that wmftkbot can miss configuration changes. However, if you'd like to examine more what has happened, you can do the following.
ssh login.toolforge.org
become tk
ls -al mpic/toolforge/logs
cat <log>
tail -f <log of interest>
You will need privileged access to the tool in order to run these commands.
Even these application logs can be lossy, but they can still be helpful.
When there are changes the application log will show basic git unified diffs on the sorted JSON output from the Test Kitchen API to help you reason about the changes at a glance. Here, for example, manual intervention resulted in a configuration change to put the experiment in force immediately.
...varnish diff:
---
+++
@@ -1,500 +1,500 @@
{
"fy2025-26-we3.1-image-browsing-ab-test": {
"domains": {
...
},
"end": "2025-11-12T14:30:00Z",
- "start": "2025-11-08T14:30:00Z"
+ "start": "2025-11-06T14:30:00Z"
}
}
2025-11-06 23:22:48,309 - INFO - fields: xlab-mw-module-loaded-v2
2025-11-06 23:22:48,309 - INFO - IRC sending: !log Test Kitchen edge-unique experiments (poll 1536) - adds: none; removes: none; fields: xlab-mw-module-loaded-v2 - xLab/MPIC/TK tips at https://w.wiki/FwuD
2025-11-06 23:31:01,481 - INFO - Service alive with checks so far: 1560
Test Kitchen database experiment / instrument journal
The Test Kitchen database can be queried directly for those with adequate production Kubernetes access to the dse-k8s cluster configuration and the Data Lake. A nicer UI in Test Kitchen is envisioned for interrogating the changes that have occurred to experiment configuration (it could provide colorized diffs, filters for type of change, etc.). In a nutshell, the following commands can be useful for querying the journal table. You'll see that the configuration column takes on a JSON shape.
stat1010.eqiad.wmnet $ mysql -h an-mariadb1001.eqiad.wmnet -u test_kitchen_production -p
mysql> USE test_kitchen_production;
DESCRIBE instruments_history
SELECT * FROM instruments_history WHERE <where clause here>\G;
Components
CDN
The CDN uses the Edge Unique cookie to enroll traffic into everyone experiments. You can monitor this process via the SRE Traffic Team / Edge uniques Grafana dashboard, which shows:
- Edge Unique states
- Edge Unique validation failures
- The rate of requests enrolled into an experiment
If the rate of requests enrolled into experiments is zero or hasn't changed, then:
- Check that one or more experiments are present in the Test Kitchen API response for the CDN authority.
- If there are experiments in the response, then reach out to the SRE Traffic Team as the process on the CDN nodes that fetches the experiments has stopped
- Check that the Edge Unique states and validation failures graphs haven't changed significantly. If the graphs have changed significantly, then reach out to the SRE Traffic Team
Extension:TestKitchen
The TestKitchen MediaWiki extension:
- Forwards experiment enrollment information from the CDN to the TestKitchen SDKs
- Uses the central user ID to enroll logged-in users into logged-in experiments
You can monitor the following:
- The logs via the mediawiki OpenSearch dashboard
- The performance of Test Kitchen UI from the point of view of the app servers via the Experiment Platform / Metrics Platform Grafana dashboard
- The performance of the cache via the MediaWiki Engineering Team / WANObjectCache Key group dashboard
The Experiment Platform / Metrics Platform Grafana dashboard shows the rate of requests to the Test Kitchen API and the number of failed requests.
If the rate of requests is zero, then check that:
- Check that the number of failed requests is normal
- If the number of failed requests is unusually high, then reach out to DPE SRE as the extension cannot communicate with Test Kitchen UI
- Check that the extension is loaded via meta:Special:Version and that
$wgMetricsPlatformEnableExperimentConfigFetchingis truthy- If the extension is loaded and configured correctly, then reach out to Experiment Platform as a bug has been introduced into the codebase, which will need to be fixed immediately
If the rate of requests is unusually high, then reach out in the #engineering-all Slack channel or in the #wikimedia-operations Slack channel as the WAN cache hit rate has decreased.
EventGate
Test Kitchen UI
Test Kitchen UI allows users to define and coordinate experiments. The CDN and Extension:TestKitchen fetch experiments via the Test Kitchen API every minute per cache node and per DC, respectively.
You can monitor the following:
- The logs via the Test Kitchen UI production OpenSearch dashboard
- The internal performance of Test Kitchen UI via the Service / Test Kitchen UI Grafana dashboard, which shows:
- The rate of all requests
- The rate of all errors
- Directly via kubectl
If the rate of all requests is zero or unusually low or unusually high, then either the CDN or Extension:TestKitchen components are misconfigured or in an error state.
Real-world Examples
(please add examples here as you encounter them. This section will serve as a recipe book for solving or triaging quickly)
Discovering a Severe Data Loss Issue
FY25-26 SDS2.4.11 involved running an A/A experiment. Experiment Platform validated that events were being sent by the Test Kitchen SDKs and were arriving in the event.product_metrics_web_base Hive table. However, we didn't know what the rate of events arriving in the Hive table should be.
Part of signing off on SDS 2.4.11 was validating that EventGate didn't log any errors during the experiment. To do this, we checked:
- The EventGate validation errors OpenSearch dashboard for validation errors for events on the
product_metrics.web_baseevent stream for the duration of the experiment - The eventgate OpenSearch dashboard
As well as validation errors, we were interested in subject ID hoisting errors. EventGate will throw an HoistingError error when:
- It can't parse the
X-Experiment-Enrollmentheader from the CDN; or - It can't extract the subject ID from the
X-Experiment-Enrollmentheader related to the experiment named inexperiment.assignedevent property
EventGate doesn't always include the event stream name as context in its logs. However, if an error is thrown, then information about the error is included in the err.* properties. So to find the number of subject ID hoisting errors logged during the experiment, we filtered the eventgate OpenSearch dashboard for err.name: HoistingError.
We found that EventGate was encountering a subject ID hoisting error for ~48.71% of events. We notified SRE Traffic and DPE immediately because the bug involved the CDN and EventGate. The bug was found to be in the X-Experiment-Enrollment header parser in EventGate. It was fixed very soon after it was reported.