Analytics/Data Lake/Traffic/SessionLength

From Wikitech
Jump to navigation Jump to search

Session Length is a standard web metric that measures the time users engage with a site. In some instances long session lengths might be indicative of "successful" user interactions (games, social networks). In others, quite the opposite (search engines). For Wikipedia and other Wikimedia free knowledge wikis, session length is important to understand and monitor how readers are interacting with our sites and to help us evaluate the impact of new features and other initiatives. The objective of this document is to describe all aspects related to this metric, including requirements, methodology, implementation, etc.


Figure 1: Illustration of how sessions are counted differently in our session length calculation process, depending on their length and start time.
Figure 2: Illustration of the tick event-based session length calculation process described on this page.

Our commitments to privacy and anonymity inform many of our technical decisions. For example, we don't uniquely identify devices to count them – preferring instead to use a solution based on notion of "last access" (cf. unique devices research). Our solution for session length metrics follows a similar privacy-focused pattern of sending minimal amount of data without any identifying information.

The source of the data is the Session Tick instrument, which manages locally-stored session information – e.g. resets the session after a long enough span of inactivity – and sends "I'm alive" events at a regular interval. These are the ticks. This is illustrated in Fig 1 (which highlights 4 sessions that exemplify 4 different scenarios) and first panel of Fig 2 (which shows 100 simulated sessions).

Example sessions
Session Scenario Notes
A Session started before the current day Session counted as 2 separate sessions (short session day before, long session current day)
B Session exceeds the end of current day Session counted as 2 separate sessions (short session current day, long session next day)
C Session started and ended within current day Session counted as 1 session
D Session started before and exceeded the current day Session counted as 3 separate sessions (short session day before, day-long session current day, and short session next day)

The only information transferred is simply the current value of the counter – no session IDs and no user IDs. On the receiving end all we see are mixes of 0s, 1s, 2s, etc. – indicating how many ticks the session has lasted so far – from millions of sessions across the world. All tick events from all the sessions are mixed together, and the raw dataset resembles second panel of Fig 2.

Once all of the day's events have been processed and stored, we can calculate the session length metrics for that day in the form of percentiles. First, we count how many instances of each tick we received: how many "0" ticks, how many "1" ticks, how many "2" ticks, etc. – illustrated in the third panel of Fig 2.

However, while that panel differentiates between last ticks and all the intermediate ticks, in practice we don't actually know which ticks were their sessions' last. Fortunately, we can employ a clever trick to deduce how many sessions lasted 20 ticks, 12 ticks, and how many did not even last long enough for the first tick (beyond the initial 0 tick). By starting with the lowest observed tick value and working upwards we are able to estimate number of sessions. From those counts of sessions and how long they lasted we are able to calculate percentiles (% of sessions lasting up to some length, % of sessions lasting at least some length) – illustrated in the remainder of Fig 2. Refer to § Methodology for a walkthrough with example data.

This process depends on some assumptions, namely that tick events won't be lost in the transmission (or at least the number of lost events is negligibly small) and that the volume of interrupted sessions (sessions on the boundary between two days) is sufficiently small against the volume of sessions contained within the 24-hour aggregation window. Fig 1 illustrates this latter, literal edge case: some sessions start before the window, so when we're counting ticks within the window we do not count those sessions' early ticks (session A); and when sessions last beyond the window, those sessions end up being counted as shorter than they actually were (sessions B and D). Refer to § Caveats for additional details.

The remainder of this document describes the requirements, methodology, and implementation in greater detail.



It is likely that the instrumentation for this metric sends a big amount of events. Sending events every N seconds might have a strong effect of battery drain in mobile (radio gets woken up if phone is on idle and that might be costly). We should take this into account and try to reduce the network usage, i.e. by queuing events and sending them in batches.

Data size

As this metric might produce big amounts of data, we should make sure that the pipeline can sustain it:

  • EventGate must be able to absorb the event throughput.
  • Jobs that pull data from Kafka (Camus or Goblin) and process it (Refine) must be able to deal with the data size.
  • The data stored permanently in Hive must be of a sustainable size.

We should consider instrumentation sampling and raw data purging after a given time period.

Tab browsing

Tab browsing (where you move across tabs while browsing Wikipedia articles) is a common browsing pattern on Wikipedia and we want to make sure the instrumentation takes that into account. Using the Visibility API we can know which tab is active but in order to send valid data we need to synchronize across tabs. We need the browser to support the Visibility API for this methodology to be viable. To communicate across tabs we can use cookies or LocalStorage.

Session definition

We should use the "universal analytics definition" of a session as a reference for this calculation. See: How a web session is defined in Universal Analytics


The session length metric is commonly calculated using session IDs, which allow to group the collected data by session and thus easily determine the length of them. However, in the Wikimedia Foundation we try to follow the privacy-by-design principle, and we should avoid adding yet another identifier in our collected data, and try to calculate this metric without identifiers, if possible.



When we try to determine the length of a given session, it makes sense to calculate it by subtracting session_end - session_start (elapsed time). Session_start is easy to determine, however, session_end is not trivial. There are many ways a session can end (user closes tab, user closes browser, user shuts down device, long inactivity period, window becomes hidden, etc.) and it's technically challenging to monitor all these possibilities and also to report them, once say the device has been shut down.

Therefore, our approach is to use heartbeats. Whenever a user starts a session by visiting a wiki page, we set up a heartbeat clock in their browser that will tick at regular intervals, i.e. every 60 seconds. At each tick (or heartbeat), we check how much time has passed since the start of the session, and send an event with that information. If the user has been inactive (their tabs are not visible) for more than 30 minutes, we reset their session clock to 0. Here's an example of how an event would look like:

  timestamp: "2021-02-01T15:23:04Z",
  wiki: "",
  ping: 4

Where 'timestamp' is an ISO timestamp with second precision that indicates the time the event was sent; 'wiki' is the domain of the wiki that the user was visiting; and 'ping' is the ordinal number of the heartbeat registered on the session clock at the time the event was sent. As you can see, there are no session identifiers in the event as discussed.


With this structure, this is how all events sent by the same session might look like (leaving out the wiki dimension to simplify the explanation):

(2019-05-18T01:24:08Z, 1)
(2019-05-18T01:25:08Z, 2)
(2019-05-18T01:34:08Z, 30)

Once we have this, it's easy to calculate that session's length: max(ping) * interval; where 'interval' is the interval length of the clock tick. For example, if the maximum ping for the session is 30 and the interval length is 1 minute, the corresponding session length would be 30 minutes.

Now, how do we group the events by session, given that we don't have session identifiers? The answer is we don't really need to do that. The instrumentation ensures us that if an event with tick=N exists, then there must exist N events with tick=0, tick=1, tick=2, ..., tick=N-1; because to reach the point of sending an event with i.e. tick=3, the clock must have previously sent events with tick=0, tick=1, tick=2 and tick=3. This means that the events collected for all sessions will always be distributed in a pyramid form:

count(tick=0) >= count(tick=1) >= count(tick=2) >= ...

And with that, we can calculate how many sessions of length N are there with the formula:

sessions_of_length_N = count(tick=N) - count(tick=N+1)

Where count(tick=N) is the number of sessions that reached the Nth tick (could be of length N or longer), and count(tick=N+1) is the number of sessions that surpassed the Nth tick. The result is the number of sessions that reached the Nth tick, but not surpassed it; thus, the number of sessions of length N.


Let's walk through an example: see below the clock tick events for 4 distinct sessions.

Time Session 1 ticks Session 2 ticks Session 3 ticks Session 4 ticks
2019-01-01 19:05 1 1 - -
2019-01-01 19:25 2 2 - 1
2019-01-01 19:45 3 3 1 2
2019-01-01 20:05 4 - 2 3
2019-01-01 20:45 5 - - 4

From the sessions above we compute the following table:

N (ping) count(ping=N) count(ping=N) - count(ping=N+1)
1 4 0
2 4 1
3 3 1
4 2 1
5 1 1

So, with this calculation we determine that there are: 0 sessions of length 0, 1 session of length 1, 1 session of length 2, 1 session of length 3, 1 session of length 4 and 1 session of length 5. We can use a table such as this to calculate the session length average, median, percentiles, etc. that can be easily transformed into visualizations.



The instrument has been developed within the Event Platform system. Here is the corresponding Phabricator task. And here you can find the current code under the mediawiki-extensions-wikimediaEvents repository.

It implements a session definition based on activity, where a session is a set of subsequent user interactions not separated more than 30 minutes. A page is considered inactive if:

  1. Hidden, as determined by Page Visibility (
  2. Idle, no events such as click, keyUp, scroll and/or visibility change occur.

Please, see more details in the code comments.


The session length schema is called session_tick and you can find it here, under the schemas-event-secondary repositroy. It collects 3 main fields: the wiki domain, under the meta.domain field; the timestamp, under the dt/meta.dt fields; and the tick number, under the tick field. It collects a couple extra fields, too, that can hold configuration values and test group information.


We did a study of how sampling would affect the accuracy of the resulting metric, you can find it here. We concluded that low sampling rates would drastically reduce the metric accuracy, especially for smaller wikis. On the other hand we projected the data throughput and size (see this comment), and we found that collecting data without sampling would be potentially problematic/heavy. So we concluded that the best sampling rate should be around 1/10. And that we would start collecting data at 1/100 as a test, to then transition to the final rate.

In parallel to that, the current ways the Event Platform client was able to sample were not sufficient, we needed to sample based on our current definition of session, to guarantee that all ticks belonging to a session (and no more) are collected. In this task you'll find more discussions and the corresponding code.

Raw data

The raw data is placed by the Refine system in the Analytics Hadoop cluster (HDFS) under the /wmf/data/event/mediawiki_client_session_tick directory. It is queryable through Hive or Presto under the table name event.mediawiki_client_session_tick. The data is private, so your user needs to belong to the analytics-privatedata-users group.

The data contains 1 event per row. The schema is the same as the corresponding Event Platform schema mentioned above. Finally, data older than 90 days is purged on a daily basis, for privacy and space consumption reasons.

Intermediate data

The raw data contains all information that we need, but is not appropriate for analysis, especially not good for powering a dashboard (i.e. in Superset). The same information can be stored in a much more efficient way that allows us to perform analytical queries like percentile approximation for long time series in an interactive form (dashboards). Thus, the raw data is processed and replicated in a more efficient representation that we call the intermediate table. Each row in this intermediate data has a field that stores the session length in ticks (calculated with the formula described in the methodology, see queries) and aggregated session count for the session length. It is queryable through Hive or Presto under the table name wmf.session_length_daily.

Sample queries
  SUM(session_count) AS session_count
  year = 2021
  AND month = 3
  AND day = 20
  AND project = 'ja.wikipedia'
  AND session_length <= 5


Session length dashboard is available in Superset using intermediate data session_length_daily. In this dashboard, we are able to explore percentiles of session length, proportions and counts of session in session length buckets, as well as estimated counts of sessions.


Interrupted sessions

To compute the session length metric on a regular basis, we need to establish a querying period, for example, daily. Sessions that go across the border of the day (00:00h) are counted imprecisely, whether this is a problem when it comes to data precision remains to be seen. A session whose pings (5,6,7) is on the "other side" of the border is artificially counted as session of length "4", while a session of length "4" in the next interval will be artificially counted as a session of length "7". Note, sessions length counts will still be correct, but some of them may be attributed to the nearest day. This issue, over an extended period of time probably has little influence of the precision of the overall data given the round-the-clock nature of Wikipedia's traffic. We can prune data that is out of sequence (i.e. discarding sessions that start/finish in the middle of our interval) but that might be an intensive step as with this type of calculation we essentially would be need to look at very single data point to prune effectively. We probably need to quantify experimentally how big is the problem of "sessions going across boundaries". When we say "boundary" we mean a day cause we are calculating a histograms of sessions given a UTC day.

Lost events

In the calculation of the metric, we assume that there are no lost events. However, that's not necessarily true. Events can get lost due to network issues (or other issues). So, we can not ensure that 100% of the events will land, and thus can not ensure total accuracy in the session length calculation. That said, few events missing will not break the calculations, they will only proportionally alter the counts of sessions of a given length.

JS disabled

For clients that are not served javascript, we will not be obtaining any data. This includes clients that do not support javascript (a few) but also clients to whom we do not serve javascript (older versions of IE, for example).

Events by bots

Bots that crawl the site using js are going to be counted. Whether this effect on sessions is significant remains to be seen cause it could just skew the every end of the "short session" tail, meaning that we would see many more very short sessions than the ones we really have.

Tab Browsing

Both the methods described have an issue with tab browsing. Tab browsing (where you move across tabs while browsing wikipedia articles) is a common browsing pattern on Wikipedia and we want to make sure only one tab is sending "valid" pings. Using the Visibility API we can know which tab is active but in order to send valid pings from that one tab we need to synchronize across tabs. We need the browser to suppor the visibility API for this methodology to be viable.

In order to "catch" tabbed browsing I think our only recourse to communicate across tabs is using LocalStorage and that is why we need to persist the value of the "ping" (as well as the session-length-identifier should thee be one) to LocalStorage every time we send a heartbeat.

Workflow when user opens a new tab:

  • Reader was looking at page X for 20 seconds, there is a record on LocalStorage that looks like (00020, 4) where 00020 is the session-length-identifier and 4 is the number of pings that have happened. If there is no session-length-id we will just be storing the value of the ping.
  • Reader opens new tab for wikipedia, the PageVisibilityApi tells heartbeat loop that it needs to halt.
  • New tab checks whether there is an on-going session by verifying whether the (session-length-identifier, ping) tuple on LocalStorage has not expired.
  • Loop executes and sends heartbeat incrementing values, persists new values to LocalStorage reseting TTL
  • go on ...

Passive event listener support

To avoid imposing a performance penalty on older clients, we exclude clients without passive event listener support from the instrument. Before finalizing this exclusion, we instrumented all clients and included a flag indicating whether the client supported passive event listeners, in order to determine the percentage of clients lacking support and to compare the session length distributions with and without support. The data collected showed that only 0.77% of clients lack passive event listener support. Although the dropoff between ticks 0 and 1 was steeper in our sample for clients without passive event listener support, we decided to discontinue the experiment and disable session tick collection from this group in order to avoid further negatively impacting their user experience.[1]

Ad blockers

Currently, ad blockers affect the collection of the session_tick events. EventGate, our event intake service, is setup at the URL Some ad blockers block URLs with the string analytics on them, because it's a common name used in third-party data collection. Thus, devices using ad blockers might not be sending session_tick events. It's difficult to determine what percentage of events we're loosing, but depending on the country the percentage of devices that use ad blockers can be around 10%-40%. Probably not all ad blockers intercept session_tick events, but we might be loosing a significant share of the total events.

Now, the session_length metric is not an absolute count, rather a percentile on top of a time measurement. So, it should not be greatly affected by these missing events. However, if ad-blocker users have a different session_length behavior than non-ad-blocker users (which is possible!), then the session_length metric will be biased towards the behavior of non-ad-blocker users. We don't consider this a blocker for the announcement of the session_length metric, provided it's known, well documented, and the caveat fix is planned for the near future ( On the other hand, this caveat can more directly affect the session_count metric, because it's an absolute count. Until we fix this caveat, we might be under-reporting session counts.

Alternative ideas

The following are ideas that we discarded, but we're leaving them here as reference.

Device-aware method

This method differs with the prior one in that it requires a device identifier. The identifier makes our code less privacy sensitive, more complex and brittle and the only advantage is that it mitigates the issue of sessions going across boundaries.

We will be sending a three element tuple every N seconds (each tuple is called later in this document a "heartbeat"). (session-length-identifier, ping, timestamp)

Session-length-identifier: a random token that mw can create on a big space (at the time of this writing is 2^80). ping: identifies how many tuples have been sent timestamp: ISO timestamp with second precision like 2019-05-19T01:24:08Z.

This would leave a set of records like the following for a session device with identifier 010:

(010, 1, 2019-05-18T01:24:08Z)
(010, 2, 2019-05-18T01:24:13Z)
(010, 30, 2019-05-18T01:24:08Z)

In order to calculate session length, if we assume session-length-identifier is unique per session, we just need to do:

select max(ping)*N from SessionLength group by session-length-identifier

So, the difficult part is to ensure that session length identifier is unique per session.

The workflow would be (simplified, see tab browsing below)

  • Reader opens wikipedia page
  • We create a session-length-token using mw.getRandom()
  • We persist session-length token and ping value to LocalStorage with a TTL of say 10 seconds (let's assume we are going to send pings every 5 secs and also we have a way to "expire" records in LocalStorage, could be a fake way like: [2]).
  • Send initial tuple and set up a timer loop to check every 5 seconds whether there is a valid token in storage and send ping
  • Loop executes the first time, checks whether there is an ongoing session on LocalStorage and sends second ping. re-sets expiration date of token to 10 secs
  • Loop executes the third time and so on ...

Stretch-based collection

Overview: We can consider a session as a sequence of user activity actions (click, scroll, hover, ..?). Let's say that the interval between two consecutive user activity actions is a session stretch. The stretch can last up to 30 minutes, otherwise we consider the session ended. We can define a session stretch with two integers (a, b). a is the start of the stretch, expressed in seconds since the start of the session (a >= 0). b is the end of the stretch, expressed in seconds since the start of the session (b >= a).

Cookies: The client stores 2 cookies: sessionStartTs and lastActivityTs. sessionStart will contain the timestamp of the session's start. It is immutable within a session. lastActivityTs will contain the timestamp of the latest activity registered for the session. It is updated with each new user activity. They both have a TTL of 30 minutes, that will be renewed with each new user activity.

Schema: The event schema is called sessionStretch and has 2 fields: startSeconds and endSeconds. startSeconds and endSeconds correspond to values a and b defined in the overview.

Front-end code: We should subscribe to different user events that determine session activity. Whenever any of those fires, we call process_stretch(), which does the following:

def process_stretch():
    current_activity_ts = now()
    session_start_ts = get_cookie('sessionStartTs')
    last_activity_ts = get_cookie('lastActivityTs')
    if session_start_ts is defined and last_activity_ts is defined:
        stretch_start_seconds = last_activity_ts - session_start_ts
        stretch_end_seconds = current_activity_ts - stretch_start_seconds
        send_session_stretch_event(stretch_start_seconds, stretch_end_seconds)
        set_cookie('sessionStartTs', session_start_ts, ttl=30.minutes) # just to reset TTL
        set_cookie('lastActivityTs', current_activity_ts, ttl=30.minutes)
        set_cookie('sessionStartTs', current_activity_ts, ttl=30.minutes)
        set_cookie('lastActivityTs', current_activity_ts, ttl=30.minutes)

Possible improvement: To prevent very rapid user actions to send many contiguous events to our back-end, process_stretch() can no-op if the last event was less than i.e. 5 seconds ago.

Back-end code: The session stretch events can be easily translated to heartbeats:

def get_heartbeats(session_stretch):
    for i in range(session_stretch.start_seconds, session_stretch.end_seconds):
        stretch_duration = session_stretch.end_seconds - session_stretch.start_seconds
        heartbeat_ts = session_stretch.dt - duration + i.seconds
        yield (i, heartbeat_ts)

Once we have the heartbeats corresponding to each session stretch, we can proceed to calculate the session length as described in the main algorithm. Storing raw(er) values and calculating heartbeats in the back-end is an advantage, because we can change the heartbeat interval (percentiles?) without needing to change instrumentation. And we can make the changes retroactive. It is more flexible overall.

  1. T274264