Talk:Analytics/Data Lake/Traffic/Pageview hourly/Sanitization

Rendered with Parsoid
From Wikitech
Latest comment: 8 years ago by Halfak in topic Lead paragraph!

Public release

The intro to this page discusses a potential public release of pageview data, but the rest of the page doesn't differentiate between explorations that are intended to lead to a public release and those that are intended for long-term internal storage. Is a public release planned or is this really just an analysis for internal sanitization? --halfak (talk) 15:15, 9 February 2016 (UTC)Reply

We have not yet officially decided which precise datasets will be released publicly, but it certainly is in the plan to release new datasets (geo oriented, UA oriented for instance). The sanitization process is primarily intended for long-term internal storage, and depending on which datasets we want to release, it might be reused Joal (talk) 17:00, 9 February 2016 (UTC)Reply
Make sense. Thanks. --halfak (talk contribs) 22:21, 15 March 2016 (UTC)Reply

Request: Add examples to "The good Ks"

It's hard to follow what is being discussed re. groups and sub-groups without examples. These examples don't need to be real data, but they should help the reader (me!) see what you mean by "can almost always be identified as single sessions" and "can almost never be identified as single session". --halfak (talk contribs) 15:30, 9 February 2016 (UTC)Reply

There are some examples after the code on the dedicated page for the analysis -- Joal (talk) 18:00, 9 February 2016 (UTC)Reply
I made some edits there to make browsing the examples easier. It would be great to have your notes on the analysis.
E.g. en:Barkha_Dutt, en:Arnab_Goswami, en:Sagarika_Ghose and en:Rajdeep_Sardesai look like they could be a subgroup, but it's hard to say that en:Rosacea belongs.
But I think I'm following much better after looking here. It seems like 10 or so examples you have listed aren't enough for a robust analysis. How many did you analyze before settling on the conclusions you report in the write-up? Looking at 20 random examples would be enough for a relatively powerful analysis. --halfak (talk contribs) 22:58, 15 March 2016 (UTC)Reply
We looked at more results than the 10 presented here (limit on queries say 100 :). However it's better to provide you tools for good analysis! I have updated the examples to another set of 20 random lists and commented them (this edit), and actually found a very different result than previously: Even with only two distinct pages, I have not been able to find single sessions.I have updated the core page accordingly (this edit). --Joal (talk) 14:05, 24 March 2016 (UTC)Reply
Great! This looks good qualitatively. If we were going to release the data publicly, I think we'd want to a statistically sound assessment, but I think this looks great for internal sanitization. --halfak (talk contribs) 19:48, 24 March 2016 (UTC)Reply
One more thought. It looks like you settle on 3 IPs and 5 distinct pages, but do you really mean 3 IPs or 5 distinct pages? --halfak (talk contribs) 23:00, 15 March 2016 (UTC)Reply
Actually we really mean 3 distinct IPs AND 5 distinct pages. This criteria ensures not only not being able to reattach sessions to IPs because of having 3 of them, but also by having difficulty to build sessions given the number of pages viewed together. Put together, those two criterion gives us more safety on data not being easily attackable. --Joal (talk) 14:05, 24 March 2016 (UTC)Reply
I didn't see this assumption checked in the analysis. It seems like you look at distinct IPs and pages independently. Maybe 2 IPs and 4 distinct pages works pretty well in combination. Maybe it does not. Maybe we're just being conservative. That's OK too. --halfak (talk contribs) 19:48, 24 March 2016 (UTC)Reply

Re. use of "hourly statistics"

The document compares alternatives of "hourly" and "monthly" statistics and then plainly states that hourly will be used because the team decided upon it. It would help me understand why this is desirable if some clear justification could be added. (e.g. "It came down to runtime. A month would just take too long to be viable.") --halfak (talk contribs) 15:32, 9 February 2016 (UTC)Reply

Added some justification in that diff - Hopefully enough ? -- Joal (talk) 18:17, 9 February 2016 (UTC)Reply
Yup. Makes sense. Thanks --halfak (talk contribs) 23:02, 15 March 2016 (UTC)Reply

Legends!

What do these colors mean?

Please add a legend to the final, animated graph. --halfak (talk contribs) 23:04, 15 March 2016 (UTC)Reply

Done (this version)! --Joal (talk) 14:05, 24 March 2016 (UTC)Reply
Much better. Thank you! --halfak (talk contribs) 19:54, 24 March 2016 (UTC)Reply

Lead paragraph!

Please add a en:lead paragraph that gives a high-level overview of the problem/approach/results described in the doc. --halfak (talk contribs) 23:06, 15 March 2016 (UTC)Reply

Done (this edit) ! --Joal (talk) 14:05, 24 March 2016 (UTC)Reply
Looks good. I made some edits to re-order and simplify links. --halfak (talk contribs) 19:54, 24 March 2016 (UTC)Reply