Analytics/Data Lake/Traffic/Banner activity
Banner Activity is a private data set served through Analytics' Druid datastore and Pivot analysis tool. It contains data about banner impressions, campaigns, status codes and others across all wikis and languages in minutely resolution. It might contain privacy-sensitive data, and therefore it is accessible only by employees or NDA holders, and its sensitive data is automatically deleted after 90 days. The data starts November 28, 2016.
DIMENSIONS Time time Highest resolution: minutely. Starting 2016-11-28. Anonymous boolean Whether the user receiving the banner was anonymous. Banner string Id of the shown banner. Bucket int (0, 1, 2, or 3) Bucket number for testing purposes. Campaign string Id of the campaign the shown banner belongs to. Country string Country code where the banner was shown. Country Matches Geocode boolean Whether the country passed from the client matches the country extracted by geocoding. Device string (desktop, android, iphone, ipad, unkown) Device where the banner was shown. Project string Project as in: Wikipedia, Commons, Wiktionary, etc. Region string Region where the banner was shown. Sample Rate float Rate in which the banner events are sampled (0, 1]. Status Code string Code describing the execution path of the banner logic. Uselang string Language code of the language selected by the user. MEASURES Request Count int Number of requests concerning banner activity (sampled). Normalized Request Count int Normalized request count (Request Count / Sample Rate).
Loading of the data
The Banner Activity data set is loaded using a collection of Oozie jobs and a Spark-Scala streaming job.
Nearly real-time (streaming) loading
The Spark-Scala streaming job is constantly loading banner data into the data set in a minutely resolution. It lacks the capacity to geocode and hence populate some fields: "Region" and "Country Matches Geocode". This job is for now experimental and the Analytics team is not giving high priority to unbreak it if it fails.
An Oozie job runs every day and loads a full day of data (overwriting the real-time data if present). The advantages of having this daily job are: better (daily) compaction in Druid datastore, and the addition of the 2 geocode fields: "Region" and "Country Matches Geocode". Note that the data is still stored in the minutely resolution.
Another Oozie job runs every month and re-loads a full month of data (overwriting the daily data). The advantage of this is pure efficiency. If the data is organized in monthly blocks, Druid is able to compact it better and be more efficient. Note that the data is still stored in the minutely resolution. Also, this job does not recompute all the data, it uses the already processed daily data stored in Druid format.
Sanitization after 90 days
The last Oozie job just repeats what the monthly loading job did, but removing some fields that are privacy-sensitive. Those are: "Region" and "Device". The data resolution remains minutely. Also, this job does not recompute all the data, it uses the already processed monthly data stored in Druid format.
Examples of use
Changes and known problems since 2017-05-01
|Date from||Date until||Task||Details|