Analytics/Data Lake/Traffic/Webrequest/Tagging

Webrequest Tagging

We are in the process of adding a “tags” column to webrequest. This tag column is an array that can hold values like: “portal”, “wikidata", "pageview". The pageview refinement process will be enhanced with a tagging step, in which some requests (pageviews or not) will be marked with one of many tags.

Once tagging phase is completed a second process will read the tag column. A small number of tags will be used for splitting the webrequest dataset in smaller datasets using hive dynamic partitioning. Many of our regular data-generation jobs read every record in webrequest when they actually need only a portion of it. Splitting the data into pre-filtered datasets will optimize our jobs, as they would be able to read just pertinent data.

Not all tags will be used for partitioning, just a smaller set, other tags might be short lived and used to more efficiently select records from webrequest table.

Usage of tags column in SQL

The tags column is an array<string>, a hive complex type. Selects to get elements can look like:

Select tags from webrequest where year=2017 and month=09 and day=09 and hour=09 limit 5;

This might return something like:

["wikidata-query","sparql"]
["portal"]
["wikidata-query","sparql"]
[]
[]

Also:

Select tags[0]

Will return 1st element of array.

Code example

See how tags are implemented to "tag" wdqs https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/webrequest/tag/WDQSTagger.java