Metrics Platform/How to/Creating a Stream Configuration

From Wikitech
The development of the Metrics Platform, organized in three phases, is ongoing. You can learn what MP components are currently available for use.
If you want to observe a Beta cluster stream using the Beta cluster eventstreams-ui tool, you must ensure that the stream is defined for the Beta Cluster Meta-Wiki, as determined by the rules explained in Configuration files. That tool fetches stream configurations from that Meta-Wiki.
If you want to create a stream on the Beta Cluster which is not configured on Production (or is configured differently on Production), you need to declare that stream in InitialiseSettings-labs.php. In some cases, you may need to prepend your InitialiseSettings-labs.php configuration key with '-', as explained in Configuration files.

Metrics Platform Stream Declarations

Metrics Platform streams are declared in $wgEventStreams, like other Event Platform streams, but with

  • the stream's schema_title property set to /analytics/product_metrics/web/base(the Metrics Platform schema), and with
  • a metrics_platform_client element included in the stream's producers property. (Note that additional producers may be included, so long as their output conforms to the Metrics Platform schema.)

The metrics_platform_client element, in turn, may include the following optional properties:

  • provide_values: the contextual values that should be added into the event before it is submitted to this stream. See metrics_platform_client.schema.json for a complete list of contextual values. If provide_values is not present, none of the contextual values will be provided. (The stream's events will still contain $stream, name, dt, and possibly custom data, as described in the documentation for submit, at Metrics Platform/Implementations#API.)
  • curation: a list of curation rules, as discussed in a later section.

These elements (except for curation) are illustrated in Example 1.

General documentation for stream configuration is at Event Platform/Stream Configuration and Wikimedia_Product/Analytics_Infrastructure/Stream_configuration.

Example 1

This example shows the declaration of a default Metrics Platform stream my.stream, as it might appear in operations/mediawiki-config/wmf-config/ext-EventStreamConfig.php. It illustrates the schema_title and metrics_platform_client declaration elements discussed above.

In addition, this example shows how you can set the default sampling rate to 0, and default sampling unit to pageview, and then give foowiki its own distinct sampling rate of 0.2. Sampling configuration for Metrics Platform streams is no different than for other Event Platform streams. Metrics Platform code takes care of sampling in accordance with the relevant stream configurations. Additional details about sampling units are available at Metrics Platform/Sampling Units. To learn more about sampling configuration, see Wikimedia_Product/Analytics_Infrastructure/Stream_configuration#Sampling_settings.

Additional information about configuration formats is available in Configuration files.

<?php

// …

'wgEventStreams' => [

    // Define the stream for all wikis in production and on the Beta Cluster
    // including the Meta-Wiki, which means that you can observe events flowing
    // on it using the eventstreams-ui tool.
    'default' => [
        'my.stream' => [
        
            // The Metrics Platform web base schema.
            'schema_title' => 'analytics/product_metrics/web/base',
			
			'destination_event_service' => 'eventgate-analytics-external',

            'producers' => [
                'metrics_platform_client' => [

                    // The contextual values that should be mixed into the event
                    // before it is submitted to this stream.
                    'provide_values' => [
                        'agent_client_platform',
                        'agent_client_platform_family',
                        'mediawiki_database',
                        'mediawiki_is_production',
                    ],
                ],
            ],
        
            // Do not submitted events to this stream by default.  Sampling
            // rates are set below, as needed, for each wiki.
            'sample' => [
                'unit' => 'pageview',
                'rate' => 0,
            ],
        ],
    ],
    
    // …
    
    // Use a sampling rate of 0.2 for my.stream on foowiki.  (Instead of a wiki,
    // this could also be a dblist, e.g. group0, group1, etc.)
    '+foowiki' => [
        'my.stream' => [
            'sample' => [
                'rate' => 0.2,
            ],
        ],
    ],
],

// …

Example 2

This example shows how you could set a sampling rate of 1 for foowiki on the Beta Cluster, using a +foowiki element in operations/mediawiki-config/wmf-config/InitialiseSettings-labs.php (which would override the +foowiki element declared above):

<?php

// …

'wgEventStreams' => [

    // …

    // As above, submit all events to this stream on foowiki on the Beta
    // Cluster.
    '+foowiki' => [
        'my.stream' => [
            'sample' => [
                'rate' => 1,
            ],
        ],
    ],

    // …

],

// …

Curation Rules

The Metrics Platform supports the specification of curation rules, which provide conditional filtering of events. Curation rules are specified using the (optional) curation property of the metrics_platform_client producer, for a particular stream. Each curation rule specifies a simple condition that must be met by an event for that event to be submitted to the stream. An event will only be submitted to a stream if all curation rules evaluate to true for that event.

Each curation rule is associated with a contextual attribute, and has 2 parts: an operator and an operand. When the value of the contextual attribute (for a particular event) is combined with the operator and operand of a curation rule, it forms a simple Boolean expression to be evaluated by Metrics Platform code. For example, the curation element shown below associates one rule with the contextual attribute page_namespace_name. The operator of the rule is equals, and its operand is 'Talk'. When this rule is evaluated for a particular event, MP code first obtains the value of the contextual attribute. If that value is in fact 'Talk', the rule evaluates to true; otherwise it evaluates to false.

As another example, the curation element below also associates two rules with the contextual attribute page_id. The first rule employs operator less_than and operand 500. The 2nd rule employs operator not_equals, and operand 42. Considering these two rules, an event will only be submitted if its page_id is less than 500, but also not 42.

Example 3

For this example, we have copied Example 1, omitted some comments and details, and added a curation element.

<?php

// …

'wgEventStreams' => [
    'default' => [
        'my.stream' => [     
            'schema_title' => 'analytics/product_metrics/web/base',			
			'destination_event_service' => 'eventgate-analytics-external',

            'producers' => [
                'metrics_platform_client' => [
                    'provide_values' => [  ],

                    'curation' => [
						'page_namespace_name' => [
						    'equals' => 'Talk'
					    ],
					    'performer_is_logged_in' => [
						    'equals' => true
					    ],
						'page_id' => [
						    'less_than' => 500,
						    'not_equals' => 42
					    ],
					    'performer_edit_count_bucket' => [
						    'in' => [ '100-999 edits', '1000+ edits' ]
					    ],
					    'performer_groups' => [
						    'contains_all' => [ 'user', 'autoconfirmed' ],
						    'does_not_contain' => 'sysop'
					    ],
                    ],
                ],
            ],
        ],
    ],

    // …

],

// …

The operator of a rule can be any one of these: equals, not_equals, less_than, greater_than, greater_than_or_equals, less_than_or_equals, in, not_in, contains, does_not_contain, contains_all, contains_any.

Operands can be primitive values (strings, numbers, Boolean values, or null), or, in some cases, an array of primitive values. For each rule, the appropriate operand type(s) depends primarily on the operator, but sometimes also on the contextual attribute the rule is associated with. For example, if operator equals is used with contextual attribute page_id it only makes sense for the operand to be a number, but if equals is used with page_namespace_name it only makes sense for the operand to be a string. Arrays of primitive values are appropriate for use with in, not_in, contains_all, and contains_any.

See metrics_platform_client.schema.json#61 for the formal declaration of the available operators, and the (syntactically) allowed operand types for each operator. (Note that the word operator is not used in the schema file; property is used instead.)