Jump to content

Talk:Data Platform/Systems/EventLogging/Sanitization vs Aggregation

From Wikitech

To be honest I'm pretty agnostic to the proposed solutions aside from the amount of work involved to implement.Having said that, it seems the 2nd option is a little less intensive, so I lean that way.

Update from Analytics: You're right, option 2 has less to-dos, but that's mainly for the Analytics team, the quantity of work for the Mobile team would be similar in both cases, and out preference would be option 1 for several reasons:

In option 1, Mobile team needs to modify the instrumentation, but this doesn't need to happen right away, Analytics can carry on with the EL audit without the new instrumentation. We just want to make sure that new schemas created in the future possess the new field. On the other hand option 1 is less aggressive with the data, bucketizing the editCount field instead of deleting it.

In option 2, Mobile team needs to determine and implement a set of metrics they want to persist in the reports before the auto-purging starts. Otherwise the new metrics won't be able to have historic data on editCount field. So Analytics EL audit would be blocked on this task for Mobile team.

I saw the comment about how it wouldn't be query-able via SQL… but there isn't anything stopping us from storing the results in a database rather than a log file… which may or may not be helpful in this case. Either way, I'd like to see people more affected by this than me chime in as I think there preferences are more important.

Update from Analytics: You're right here as well, there's no technological reason stopping us to store that data in another DB. We'd just prefer not to have various DBs storing EventLogging potentially sensitive data. We'd like to have a single source of that data for easier control. We consider the EL report pipeline as part of EL system and we'd prefer to use it to create the report files, vs a new replication-like feature.

[Above from Corey Floyd]

Does this affect the Limn graphs / Mobile report card?

Just wondering if this would affect http://mobile-reportcard.wmflabs.org/#apps-graphs-tab. There we query for much older data than just 90 days.

My guess is no because those SQL queries don't seem to be using the editCount field. Instead, those reports calculate number of events using the COALESCE and SUM functions. Just checking. BearND (talk) 20:25, 29 July 2015 (UTC)Reply

Update from Analytics: Exactly, there are 2 ways of querying EL database for reports: 1) incrementally day by day, which is not a problem because it will use only recent data; and 2) globally, in this case the query can not use auto-purged fields. But in short, the mobile-reportcard will continue working normally. In fact, we Analytics are working on it right now to unbreak several reports that were stuck.

I like Option 2 better

Seems to be much less work, seems like no data will be deleted, and we can dump the TSV reports to a database or use something like this to query it if we need so. Jhernandez (talk) 12:42, 30 July 2015 (UTC)Reply

Update from Analytics: In fact, with option 2 more data will be deleted: option 1 means delete userId and bucketize editCount, option 2 means deleting both userId and editCount. Sorry if this was not clear. Please read also the other comments on both options in the first question of the page.

Then I guess Option 1 is the simplest (only one place to go for data) instead of having separate stored reports for the edit buckets. Whatever you guys think is better. Jhernandez (talk) 10:51, 31 July 2015 (UTC)Reply

Looks like Option 1 turns out to be easier

Based on the discussion here and further talking with Joseph and Kevin, it looks like Option 1 is actually less work for Reading and also keeps the audit and necessary purging on track, all the while without breaking Limn graphs. --Dr0ptp4kt (talk) 23:29, 4 August 2015 (UTC)Reply

Quick comment: I don't know what's better, but option 1 is easier to understand. Nemo 21:36, 31 January 2016 (UTC)Reply