Event Platform/Stream Configuration
Stream configuration refers to configuration that distributed producers or consumers of a stream might want, e.g. the sampling rate or the schema title of the events that are allowed in the stream. Stream configuration was originally a requested feature of Event Platform for Product engineers, so they could more easily vary some event stream producer setting without having to do code deploys. It has since become a critical part of Event Platform, used by multiple services.
EventStreamConfig is a MediaWiki extension that implements PHP and HTTP API for requesting stream configuration. Streams configuration entries are declared in the $wgEventStreams global list in mediawiki-config.
This centralized EventStreamConfig is used by several services to automate discovery and configuration of stream producer and consumer clients:
- EventGate service clusters uses stream config to restrict which types of events are allowed in which streams via tha schema_title setting.
- The MediaWiki EventLogging extension uses stream config to vary things like event stream sampling rate.
- The Analytics Cluster uses stream config to automate ingestion of streams into Hive.
- EventStreams uses stream config to discover streams and auto-generate OpenAPI docs.
Because this API is a MediaWiki extension deployed to all (most?) Wikimedia Foundation wikis, it can be requested from any wiki. Because the configuration of the streams is in mediawiki-config, specific per wiki settings can be provided.
It is expected that 'global' configuration be requested from meta.wikimedia.org in production. You can then override things like sample rate per wiki by configuring the override for that wiki, and then requested the config you need from that wiki's action API URL.
See the EventStreamConfig README.
Common Settings Documentation
In lieu of a better place, we'll try to document some of the common stream config settings here.
wgEventStreams is keyed by stream name. The stream name is also available as the
stream setting in API results.
This much match exactly the
title of the event JSONSchema that is allowed in this stream.
This refers to the name of the EventGate HTTP event intake service the stream should be produced through. Producer clients use this to figure out where to send the stream. The EventGate services also use this to determine if a stream is allowed to be produced through them.
NOTE: This should one day be moved into a
producers config subobject.
This aides in monitoring ingestion pipelines for event streams. If this is true (the default if not set), artificial canary events will periodically be produced into the stream. The canary events are created from the first event example in the schema, but with
meta.dt at a current timestamp, and with
meta.domain: "canary". Consumers of streams with
canary_events_enabled: true should filter out all events where
meta.domain == "canary".
These sub object config settings should be used to configure specific clients that produce or consume this stream. The keys in this subobject should be the name of the client. Clients look up their configuration from the API by this name.
As of 2021-09, this is only used for the Analytics Hadoop ingestion pipeline. See also https://phabricator.wikimedia.org/T273235.