We are in the process of adding a “tags” column to webrequest. This tag column is an array that can hold values like: “portal”, “wikidata", "pageview". The pageview refinement process will be enhanced with a tagging step, in which some requests (pageviews or not) will be marked with one of many tags.
Once tagging phase is completed a second process will read the tag column. A small number of tags will be used for splitting the webrequest dataset in smaller datasets using hive dynamic partitioning. Many of our regular data-generation jobs read every record in webrequest when they actually need only a portion of it. Splitting the data into pre-filtered datasets will optimize our jobs, as they would be able to read just pertinent data.
Not all tags will be used for partitioning, just a smaller set, other tags might be short lived and used to more efficiently select records from webrequest table.
The tags column is an array<string>, a hive complex type. Selects to get elements can look like:
Select tags from webrequest where year=2017 and month=09 and day=09 and hour=09 limit 5;
This might return something like:
["wikidata-query","sparql"] ["portal"] ["wikidata-query","sparql"]  
Will return 1st element of array.
See how tags are implemented to "tag" wdqs https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/webrequest/tag/WDQSTagger.java