Jump to content

Experimentation Lab/Incident reports/2028-09-17 MinT for Readers AA test missing subject IDs

From Wikitech
Status Closed
Severity Low
Incident coordinators Mikhail Popov, Sam Smith
Incident response team Experiment Platform
Date detected 2025-09-17
Date resolved 2025-09-17

Summary

Detection, root cause analysis, and resolution all occurred on 2025-09-17:

  • We noticed that the MinT for Readers A/A test only had 1 subject in each group. Investigation revealed that the subject IDs were not being populated.
  • We identified the cause as a misconfigured stream (custom stream created for this particular experiment).
  • We submitted and deployed the fix.

We strongly recommend validating stream configuration against a schema and requiring changes to stream configuration to pass schema validation before they are allowed to be merged. This is captured in T405516.

Background

On September 17th, 2025, we noticed that the MinT for Readers A/A test (xLab) – active since September 16th – only had 1 subject in each group according to the dashboard (Superset). This was very unusual because the experiment was active on 13 Wikipedias with 10% traffic allocation rate and the instrumentation is a simple page visit event.

We queried the raw interaction data in the event.mediawiki_product_metrics_translation_mint_for_readers_experiments table and saw that the ~265K events collected all had experiment.subject_id set to "awaiting" – which is what we set it to for edge unique-based experiments before sending the event through a special path that populates the value using the wmf-uniq cookie.

Root cause

The root cause was that the MinT for Readers A/A experiment stream (mediawiki_product_metrics_translation_mint_for_readers_experiments) was subtly misconfigured. EventGate, the event service, must be explicitly configured to hoist experiment subject IDs from the X-Experiment-Enrollments header into the experiment.subject_id property of the events flowing on the stream. The EventGate configuration must be in the $.producers.eventgate stanza. However, in the case of the MinT for Readers experiment stream, it was in the $.eventgate stanza (lines 6–9):

// https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/acad29faf2f945642ef0f45cf5a9569b41a10fba/wmf-config/ext-EventStreamConfig.php#2388
'mediawiki.product_metrics.translation_mint_for_readers.experiments' => [
	'schema_title' => 'analytics/product_metrics/web/translation',
	'destination_event_service' => 'eventgate-analytics-external',
	'producers' => [
		'eventgate' => [
			'enrich_fields_from_http_headers' => [
				'http.request_headers.user-agent' => false,
			],
		],
		'metrics_platform_client' => [
			'provide_values' => [
				'mediawiki_database',
				'mediawiki_skin',
				'mediawiki_site_content_language',
				'mediawiki_site_content_language_variant',
				'page_content_language',
				'agent_client_platform',
				'agent_client_platform_family',
				'performer_session_id',
				'performer_active_browsing_session_token',
				'performer_is_logged_in',
				'performer_is_temp',
				'performer_language',
				'performer_language_variant',
				'performer_pageview_id',
			],
		],
	],
	'eventgate' => [
		'enrich_fields_from_http_headers' => [
			// Don't collect the user agent
			'http.request_headers.user-agent' => false,
		],
		'use_edge_uniques' => true,
	],
],

Identifying the root cause hinged on the understanding that (invalid) experiment-related analytics events were in the correct Hive table. For this to be occurring, the following must be true:

  1. The Varnish Experiment Enrollment Sampling Authority is enrolling devices and assigning them to groups;
  2. The MetricsPlatform MediaWiki extension is forwarding those assignments to the xLab JS SDK; and
  3. The xLab JS SDK is sending experiment-related analytics events to the /evt-103e/v1/events path
  4. The /evt-103e/v1/events path is checking whether the device is assigned a group and forwarding the experiment-related analytics events to EventGate

With the above in mind, we checked the EventGate logs and saw the following:

Stream config setting 'producers.eventgate.use_edge_uniques' is disabled for event of schema at /analytics/product_metrics/web/translation/1.4.2 destined to stream mediawiki.product_metrics.translation_mint_for_readers.experiments, but x-experiment-enrollments request header is set. Ignoring x-experiment-enrollments.

This message is logged when an experiment-related analytics event is sent to EventGate but is destined for a stream for which experiment subject ID hoisting is disabled.

Resolution

We deactivated the experiment in xLab and wrote a patch to fix the stream configuration. We deployed the patch in a backport on the same day and adjusted the experiment's start date to September 19th, adjusting the end date accordingly.

Recommendations

Schema validation

To safeguard against future incidents like this, we highly recommend for wgEventStreams (source) to have a schema and for there to be a test that validates the stream configuration against that schema, preventing patches from being merged if the resulting stream configuration does not pass schema validation. This is captured in T405516.

Documentation improvements

We also recommend updating the documentation because Edge Uniques part of stream configuration is only mentioned at

and not at all at