Talk:Analytics/Data Lake/Traffic/Pageview hourly/Sanitization

From Wikitech
Jump to navigation Jump to search

Public release

The intro to this page discusses a potential public release of pageview data, but the rest of the page doesn't differentiate between explorations that are intended to lead to a public release and those that are intended for long-term internal storage. Is a public release planned or is this really just an analysis for internal sanitization? --halfak (talk) 15:15, 9 February 2016 (UTC)Reply[reply]

We have not yet officially decided which precise datasets will be released publicly, but it certainly is in the plan to release new datasets (geo oriented, UA oriented for instance). The sanitization process is primarily intended for long-term internal storage, and depending on which datasets we want to release, it might be reused Joal (talk) 17:00, 9 February 2016 (UTC)Reply[reply]
Make sense. Thanks. --halfak (talkcontribs) 22:21, 15 March 2016 (UTC)Reply[reply]

Request: Add examples to "The good Ks"

It's hard to follow what is being discussed re. groups and sub-groups without examples. These examples don't need to be real data, but they should help the reader (me!) see what you mean by "can almost always be identified as single sessions" and "can almost never be identified as single session". --halfak (talkcontribs) 15:30, 9 February 2016 (UTC)Reply[reply]

There are some examples after the code on the dedicated page for the analysis -- Joal (talk) 18:00, 9 February 2016 (UTC)Reply[reply]
I made some edits there to make browsing the examples easier. It would be great to have your notes on the analysis.
E.g. en:Barkha_Dutt, en:Arnab_Goswami, en:Sagarika_Ghose and en:Rajdeep_Sardesai look like they could be a subgroup, but it's hard to say that en:Rosacea belongs.
But I think I'm following much better after looking here. It seems like 10 or so examples you have listed aren't enough for a robust analysis. How many did you analyze before settling on the conclusions you report in the write-up? Looking at 20 random examples would be enough for a relatively powerful analysis. --halfak (talkcontribs) 22:58, 15 March 2016 (UTC)Reply[reply]
We looked at more results than the 10 presented here (limit on queries say 100 :). However it's better to provide you tools for good analysis! I have updated the examples to another set of 20 random lists and commented them (this edit), and actually found a very different result than previously: Even with only two distinct pages, I have not been able to find single sessions.I have updated the core page accordingly (this edit). --Joal (talk) 14:05, 24 March 2016 (UTC)Reply[reply]
Great! This looks good qualitatively. If we were going to release the data publicly, I think we'd want to a statistically sound assessment, but I think this looks great for internal sanitization. --halfak (talkcontribs) 19:48, 24 March 2016 (UTC)Reply[reply]
One more thought. It looks like you settle on 3 IPs and 5 distinct pages, but do you really mean 3 IPs or 5 distinct pages? --halfak (talkcontribs) 23:00, 15 March 2016 (UTC)Reply[reply]
Actually we really mean 3 distinct IPs AND 5 distinct pages. This criteria ensures not only not being able to reattach sessions to IPs because of having 3 of them, but also by having difficulty to build sessions given the number of pages viewed together. Put together, those two criterion gives us more safety on data not being easily attackable. --Joal (talk) 14:05, 24 March 2016 (UTC)Reply[reply]
I didn't see this assumption checked in the analysis. It seems like you look at distinct IPs and pages independently. Maybe 2 IPs and 4 distinct pages works pretty well in combination. Maybe it does not. Maybe we're just being conservative. That's OK too. --halfak (talkcontribs) 19:48, 24 March 2016 (UTC)Reply[reply]

Re. use of "hourly statistics"

The document compares alternatives of "hourly" and "monthly" statistics and then plainly states that hourly will be used because the team decided upon it. It would help me understand why this is desirable if some clear justification could be added. (e.g. "It came down to runtime. A month would just take too long to be viable.") --halfak (talkcontribs) 15:32, 9 February 2016 (UTC)Reply[reply]

Added some justification in that diff - Hopefully enough ? -- Joal (talk) 18:17, 9 February 2016 (UTC)Reply[reply]
Yup. Makes sense. Thanks --halfak (talkcontribs) 23:02, 15 March 2016 (UTC)Reply[reply]


What do these colors mean?

Please add a legend to the final, animated graph. --halfak (talkcontribs) 23:04, 15 March 2016 (UTC)Reply[reply]

Done (this version)! --Joal (talk) 14:05, 24 March 2016 (UTC)Reply[reply]
Much better. Thank you! --halfak (talkcontribs) 19:54, 24 March 2016 (UTC)Reply[reply]

Lead paragraph!

Please add a en:lead paragraph that gives a high-level overview of the problem/approach/results described in the doc. --halfak (talkcontribs) 23:06, 15 March 2016 (UTC)Reply[reply]

Done (this edit) ! --Joal (talk) 14:05, 24 March 2016 (UTC)Reply[reply]
Looks good. I made some edits to re-order and simplify links. --halfak (talkcontribs) 19:54, 24 March 2016 (UTC)Reply[reply]