Talk:Analytics/Data Lake/Traffic/ReaderCounts

Rendered with Parsoid
From Wikitech
Latest comment: 1 year ago by Bearloga in topic Cohorts and thresholds

Cohorts and thresholds

The initial thresholds are:

  • Occasional: 1-5 article views
  • Regular: 6-15 article views
  • Informed: 5+ article views AND 2+ talk page views

This paper analyzed "Wikipedia rabbit holes" and found that on English Wikipedia:

  • "Number of articles loaded in a session is skewed with average of 24 pageloads per session and a median of 18 pageloads (Q1 = 14, Q3 = 28)."
  • "Median depth of the trees is 13, meaning that half of the rabbit hole sessions do not extend beyond 12 clicks away from the first page."

We may want to consider different thresholds or perhaps use something like an average # of articles per session? — Bearloga (talk) 18:56, 6 January 2023 (UTC)Reply

Thank you for sharing the paper. The initial thresholds are entirely arbitrary chosen and are intended as a strawman. Could you please update them with your best understanding? From the product side it would be very helpful if the ordering of the thresholds represents a gradient of sophistication and/or familiarity with the Wikis. SCherukuwada (talk) 13:25, 18 January 2023 (UTC)Reply
"gradient of sophistication and/or familiarity with the Wikis" Oooh I like that a lot! Might need to collaborate with Design Research on that (and PdM's, of course). Bearloga (talk) 16:35, 19 January 2023 (UTC)Reply

Multimedia viewers and Wikimedia Commons

If the cohort definitions can include views of talk pages, we can also track views of file pages (including with the multimedia viewer on wiki pages) and include that. Maybe "engaged" readers, since they're engaging with our content beyond the text.

Hm… We'd need to have a completely different set of cohorts for commonswiki. — Bearloga (talk) 19:02, 6 January 2023 (UTC)Reply

This sounds very reasonable to me, but I defer to (talk) for whether this matches what the product manager is looking for. SCherukuwada (talk) 13:45, 18 January 2023 (UTC)Reply

How client-side could be done (alternatively)

From the client-side implementation:

While this makes downstream usage of the contents of the cookie very straightforward, it presents two problems, in that a) it is difficult to determine upfront what the various thresholds and windows for determining cohorts will be, and b) the cohort calculation cannot be done purely on the data pipeline side and will require a push to the Wikis.

What if we did something similar to EventStreamConfig extension? Stream configs are available client-side and can be updated rather quickly without MW train deployment. And just like stream configs, we could expose these cohort specs via API, meaning we could implement this on the Android & iOS Wikipedia apps. (They download the stream config via https://meta.wikimedia.org/w/api.php?action=streamconfigs&constraints=destination_event_service=eventgate-analytics-external&all_settings=true at the start of each session.)

We could come up with a system of describing cohorts flexibly, allowing us to push out new cohort definitions without redeploying the logic.

This would be a very, very costly approach.

Okay, yeah, sending a map of counts is clearly the better option. — Bearloga (talk) 19:59, 6 January 2023 (UTC)Reply

This might well just fall out of how the instrument is built. Client-side instrumentation and its config – e.g. enabled?, and sampling unit and rate – are both served via ResourceLoader. This eliminates the need for a distinct API call to fetch the config for the instrument while retaining the kinds of behaviour that we would like – code and config are cached at the edge for up to 30 days, can be updated and the respective cache entries invalidated independently.
Creating an API to serve the config is a small amount of work. However, it does come with an ongoing maintenance cost. If we were to create such an API, we should be very clear that it is unstable and could be deprecated without notice. – Phuedx (talk) 19:40, 9 January 2023 (UTC)Reply
I defer to your wisdom on this. However, in the proposal as it is, the cohort calculation is entirely outside of the serving path. The client only sends counts that the ETL pipelines can map to a cohort. The only hard-to-change parameter is the window of time over which we keep page history on the client. Is that your understanding as well? – SCherukuwada (talk) 14:55, 18 January 2023 (UTC)Reply
+1 to just having ETL pipelines take care of cohort mapping from the counts. And yes, the window of time parameter does seem like it would have to be baked in and require instrumentation re-deployment if we needed to change it. Bearloga (talk) 16:32, 19 January 2023 (UTC)Reply