Analytics/Data Lake/Traffic/ReaderCounts

From Wikitech

Reader Counts is a proposed metric to allow understanding of reader behaviour which will guide the future development of features for readers, the vast majority of whom are anonymous (not logged in).

Overview

We are beginning to focus more on features that help the funneling of casual users into informed users and informed users into contributors and editors. A handful of teams at the Foundation are working on features and goals that are aimed at streamlining the path through this funnel. However, to ensure that these have the desired effect on the users and communities, we need some means of measuring how many users are in each stage of the funnel and whether our efforts to encourage more of them to engage more with the content and eventually contribute to it are working as intended. To this end, we propose defining some cohorts of readers and add this cohort as a dimension to various metrics.

Reader Cohorts

These are subject to change and are being discussed.

Let’s define some cohorts of users based on their activity over a window of W days.

  • One-off Readers: Rz are users who have read a Wikipedia article exactly once over W.
  • Occasional Readers: Ro are users who have read a Wikipedia article ≤ 5 times over W.
  • Regular Readers: Rr are users who have read > 5 ≤ 15 times over W.
  • Informed Readers: Ri are users who have read > 5 articles and have at least visited a Talk page > 2 times over W

Analytics Questions

Given below is a representative, non-exhaustive list of questions that the addition of reader cohorts will allow us to answer:

  • In Africa on French Wikipedia on mobile devices, during the month of August, how many of our readers were each of "one-off", "occasional", "regular", "heavy", and "informed" readers?
  • Across all Wikipedias, what percentage of our pageviews came from each reader type?
  • During the month of August, what is the breakdown of articles by topic visited by each reader type?

Implementation Details

Calculation of the Cohort

  • Upon the first anonymous view from a given browser, a piece of JavaScript code creates a list of pages viewed in LocalStorage. Let’s call this the Recent Pageview List (RPL). At first this contains only the current page view.
  • A cookie, which we'll call "R-Cookie" for now, is set based on what's in the RPL. What exactly is in the R-Cookie is discussed in the next subsection, but suffice it to say that there is no PII or unique identifier in this cookie.
  • This R-Cookie is now sent with every subsequent request to the server.

There are two approaches around what to store in the R-Cookie:

Calculate Cohorts Client-Side

This approach involves using client-side JavaScript to determine which cohort the contents of the RPL puts the user in, and setting the R-Cookie value to something like "z" for one-off readers, "o" for occasional readers, and so on. While this makes downstream usage of the contents of the cookie very straightforward, it presents two problems, in that a) it is difficult to determine upfront what the various thresholds and windows for determining cohorts will be, and b) the cohort calculation cannot be done purely on the data pipeline side and will require a push to the Wikis.

Send Raw Counts to Server

This approach involves sending a summary of all the counts from the RPL that a cohort might be calculated on, without transmitting anything about the specific pages involved and the specific dates they were visited on. For instance, instead of setting the R-Cookie a value such as "z", we would instead send instead a map that looks like “a=10,t=3” which indicates that the user read 10 articles and 3 talk pages. Based on this map, we could calculate the cohort in the data pipelines. As per the use cases being discussed, this is the preferred approach.

Adding the Reader Cohort Dimension to Pageviews

  • Every pageview results in the RPL being updated, a cohort calculated based on its current contents, and the R-Cookie is updated with the recalculated reader cohort.
  • The R-Cookie will always be “one pageview behind”, in that the value of the R-Cookie in a given request does not factor in the pageview that was returned through that request. The value of R-Cookie will be a part of webrequest and can be pulled into pageviews hourly (or other tables) as a dimension.

Adding a Cohort to SessionLength

  • The SessionLength instrument is calculated by ticks sent through the SessionTick instrument.
  • Sending a cohort along with the session tick instrument would work. However, if the cohort changes in the middle of a session, the session would be broken into two. To this end, an acceptable solution might be to send no more than one distinct value of cohort per UTC-bounded day, given that a session is limited to a UTC-day.

Generalising

The two cases above are known metrics to which the cohort dimension needs to be added. The general pattern that we would need to follow for each such metric is:

  • Pick up the cohort information from the R-Cookie.
  • Decide how often the cohort information should be refreshed from the R-Cookie.
  • Ensure that the cohort is added as an additional dimension in various data pipelines (i.e. add another GROUP BY as needed). We would need to offer a UDF or equivalent that will map a set of counts to a cohort in the data pipelines.

Considerations

  1. While LocalStorage as a feature is supported on all of our Grade A and Grade C browsers,[1] it is still not universal.
  2. LocalStorage is scoped per-origin. This means that each Wikipedia would have its own independent reader cohort computation. In other words, LocalStorage written through jawiki cannot be read from enwiki. This problem does not exist in the Cookie approach where setting a cookie on wikipedia.org would essentially make it available to all the Wikipedias.
  3. Some particularly intensive reading sessions involving multiple tabs might result in some race conditions.

Alternative Approach

An alternative approach that was considered involved setting a unique cookie on each reading session and grouping all the reading activity by that cookie to find out which cohort a user belongs to. This was not pursued on account of its privacy implications. It generates far too much new data and results in us having a single cookie that identifies a user and ties it to everything they've read so far.

References

  1. ↑ https://www.mediawiki.org/wiki/Compatibility#Browsers