Analytics/Data Lake/Traffic/Banner activity

From Wikitech

Banner Activity is a private data set served through Analytics' Druid datastore and Pivot analysis tool. It contains data about banner impressions, campaigns, status codes and others across all wikis and languages in minutely resolution. It might contain privacy-sensitive data, and therefore it is accessible only by employees or NDA holders, and its sensitive data is automatically deleted after 90 days. The data starts November 28, 2016.

Current Schema

Time                      time       Highest resolution: minutely. Starting 2016-11-28.
Anonymous                 boolean    Whether the user receiving the banner was anonymous.
Banner                    string     Id of the shown banner.
Bucket                    int        (0, 1, 2, or 3) Bucket number for testing purposes.
Campaign                  string     Id of the campaign the shown banner belongs to.
Country                   string     Country code where the banner was shown.
Country Matches Geocode   boolean    Whether the country passed from the client matches
                                     the country extracted by geocoding.
Device                    string     (desktop, android, iphone, ipad, unkown) Device where
                                     the banner was shown.
Project                   string     Project as in: Wikipedia, Commons, Wiktionary, etc.
Region                    string     Region where the banner was shown.
Sample Rate               float      Rate in which the banner events are sampled (0, 1].
Status Code               string     Code describing the execution path of the banner logic.
Uselang                   string     Language code of the language selected by the user.

Request Count             int        Number of requests concerning banner activity (sampled).
Normalized Request Count  int        Normalized request count (Request Count / Sample Rate).

Loading of the data

The Banner Activity data set is loaded using a collection of Oozie jobs and a Spark-Scala streaming job.

Nearly real-time (streaming) loading

The Spark-Scala streaming job is constantly loading banner data into the data set in a minutely resolution. It lacks the capacity to geocode and hence populate some fields: "Region" and "Country Matches Geocode". This job is for now experimental and the Analytics team is not giving high priority to unbreak it if it fails.

Daily loading

An Oozie job runs every day and loads a full day of data (overwriting the real-time data if present). The advantages of having this daily job are: better (daily) compaction in Druid datastore, and the addition of the 2 geocode fields: "Region" and "Country Matches Geocode". Note that the data is still stored in the minutely resolution.

Monthly loading

Another Oozie job runs every month and re-loads a full month of data (overwriting the daily data). The advantage of this is pure efficiency. If the data is organized in monthly blocks, Druid is able to compact it better and be more efficient. Note that the data is still stored in the minutely resolution. Also, this job does not recompute all the data, it uses the already processed daily data stored in Druid format.

Sanitization after 90 days

The last Oozie job just repeats what the monthly loading job did, but removing some fields that are privacy-sensitive. Those are: "Region" and "Device". The data resolution remains minutely. Also, this job does not recompute all the data, it uses the already processed monthly data stored in Druid format.

Examples of use

Changes and known problems since 2017-05-01

Date from Date until Task Details
... ... ... ...