Metrics Platform/Client

From Wikitech
This page is currently in archived status. It is not currently maintained, and some content may be out of date. After some of its relevant content has been moved to other pages, this page will be deleted.

An Event Platform Client is a member of a family of software libraries that carry out the production of events from a client application, such as the MediaWiki browser environment, the Android Wikipedia app, or the iOS Wikipedia app. The software libraries conform to a common specification, so that the entire event production is portable across platforms.

Event production

An event is a JSON string containing a stream name, event data, and event metadata.

Production of an event begins when the client application signals the occurrence of a software event to the library by calling the public method submit(streamName, eventData). The application code that does this is called an instrument, or instrumentation. The combination of stream name and event data passed to the library is called the pre-event.

Once submit() is called, the library uses the stream configuration to determine whether to process the pre-event based on any applicable sampling and filtering rules. If it will be processed, the library then computes the event metadata and formats the event data and metadata together into a JSON string called the event object, or simply event. The library then schedules the event for transmission to a destination EventGate server. Production of an event ends when the event is transmitted to the network.

What is an event?

Data

Event data refers to properties which are computed by the instrument at the time the software event is observed, and passed to the library in submit().

Compare event metadata, which are properties set internally by the Event Platform Client Library during production of the event object.

The properties that comprise event data will be different for different streams. Their structure, types, and format are defined by the event schema associated with the stream. A stream identifies which schema its events will conform to using the $schema field of its stream configuration block.

Event data is not modified in any way by the library. It will faithfully pass on what is provided. Therefore it is the responsibility of the instrumentation code (not the library) to test and ensure that the data provided will conform to the event schema.

All values must map to JSON-compatible types. Depending on the platform's type system, type conversion may take place.

Metadata

Event metadata refers to properties which are computed automatically by the library and added to the event during production.

Compare event data, which are properties set by the instrumentation code, and which vary from stream to stream depending on the event schema associated with the stream.

Event metadata properties are defined by a common schema fragment which is inherited by all event schema representing events produced by this client library. Therefore the structure and type of these properties will not vary from stream to stream. However, the presence and value of certain properties may depend on how the stream is configured.


$schema:         <uri> reference to versioned schema,
meta: {
        dt:           <string> ISO 8601 timestamp,
        stream:       <string> stream name,
}
id_pageview: ?<string> 20-character random hexadecimal string,
id_session:  ?<string> 20-character random hexadecimal string,
id_sequence: ?<string> 20-character random hexadecimal string,
id_device:   ?<string> 20-character random hexadecimal string,

Vocabulary

software event
the thing that happened in the software to trigger the instrumentation.
pre-event
the tuple of stream name and event data that the instrument provides to the library.
event (object)
the formatted JSON string combining the event data and event metadata.

Sampling

Consequences of using deterministic identifiers for sampling
  • On a given client, if a stream is in-sample (out-sample), then any stream with identical scope (e.g., pageview, session) and sampling rate will also be in-sample (out-sample).
  • On a given client, if a stream with scope s is in-sample (out-sample), then it is in-sample (out-sample) for the duration of s.
    • So on a given client, a stream 'foo' with sample rate 0.5 does not simply send events 50% of the time. If it has scope e.g. 'pageview', then during 50% of pageviews on that client, the stream will receive all 'foo' events sent to it, and during the other 50% of pageviews, it will receive none.
    • A stream 'foo' having scope 'session' and sample rate 0.5 will receive every 'foo' event generated by 50% of sessions, not 50% of the 'foo' events generated in every session.

Two models. Model one is to view a session or device as a proxy for an individual user. Sampling on a session or device ID is then a proxy for sampling an individual from the universe of individuals. Pageviews/screens are not adequate proxies for individuals.

Per-pageview/screen sampling

  • Consider every pageview as an independent event
    • A new pageview ID is generated on every navigation transition, including:
      • Refreshing the page
      • Returning to a page via a back or forward button
      • Opening the page again in the same window or tab
      • Opening the page again in a different window or tab
  • Actions which occur over multiple pageviews can no longer be correlated.
  • Actions which occur together within a single pageview can be located by aggregating by pageview id
    • When aggregating by pageview id, it can be assumed that these actions were performed by a single user.

Per-session sampling

  • Consider every session as an independent event
    • Depending on session duration, allows a more complete picture of user
  • Actions which occur over multiple pageviews can be correlated
  • Questions about the number and duration of pageviews can be answered
  • Actions which occur together within a single session can be located by aggregating by session id

Per-(pseudo-)device sampling

In both the iOS and Android apps, the app install ID is created if and when the user opts in to tracking, and deleted if the user opts out again. If the user opts in again after previously opting out, a new app install ID is created. The app install ID should not be interpreted as reflecting the entire lifetime of an app installation.
  • Typically used for mobile apps, not web
  • Consider every (pseudo-)device as an independent event
    • Depending on duration of identifier, allows a more complete picture of user
  • Actions which occur over multiple sessions can be correlated
  • Questions about the number and duration of sessions can be answered
  • Actions which occur together within a single device's lifetime can be located by aggregating by device id
  • We are not really able to talk about 'devices' because identifiers are too transient for technical and privacy reasons. It is better to think of this as a 'super-session', 'long-session', or 'multi-session'.

Instrumentation Guidelines (Draft Outline)

The value of uniformity

Tokens and identifiers

The concept of a user session, a pageview, and longer-term tokens, and the various ways they have been defined implemented. Why to prefer a common definition; evidence for what that definition should be. What to avoid when rolling your own token outside of EventLogging.

Persistence

  • Cookies, sessionStorage, localStorage.
  • Eviction policies, early eviction, LRU etc. Differences across browsers. Lack of guarantees for long TTL identifiers. Privacy concerns with long TTL identifiers. Pros/cons for using these versus session-length identifiers.
  • Cannot see how many sessions a particular user has, if there is no longer-term identifier above session.

Duplicated or unfired events (don't trust the browser)

  • Some browsers will fire multiple events at certain times, especially onunload
  • The transition into onunload is not consistent across browsers
  • Different browsers and systems have different transition models and you need to capture that

Time, entropy are not in your control

The client's notion of its local time and of the way in which it generates pseudo-randomness, is not something you can guarantee control over.

For browser instrumentation

  • Do not execute code on page load if you can help it.

OS/platform/UA

  • Do you try to do processing / categorization of these fields on your own on the client?
  • Or do you just send the UA to the server as-is?