Analytics/Data Lake/Traffic/Browser general

From Wikitech
Jump to navigation Jump to search

wmf.browser_general is an intermediate table that makes it easier and faster to get pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major, etc. In addition to ad-hoc queries, it is also used to generate the Browser Reports.

Current Schema

$ hive --database wmf

hive (wmf)> describe browser_general;
OK
col_name  data_type comment
access_method            string                   (desktop|mobile web|mobile app)
os_family                string                   OS family: Windows, Android, etc.
os_major                 string                   OS major version: 8, 10, etc.
browser_family           string                   Browser family: Chrome, Safari, etc.
browser_major            string                   Browser major version: 47, 11, etc.
view_count               bigint                   Number of pageviews.
year                     int                      Unpadded year of request.
month                    int                      Unpadded month of request.
day                      int                      Unpadded day of request.
           
# Partition Information        
# col_name               data_type                comment             
           
year                     int                      Unpadded year of request.
month                    int                      Unpadded month of request.
day                      int                      Unpadded day of request.

Notice the year, month and day fields. These are Hive partitions, and are explicit mappings to daily files in HDFS. You must include at least one partition predicate in the where clause of your queries (even if it is just year > 0). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns your results. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

Loading of anonymized data

wmf.browser_general is loaded through this oozie job. It is a basic query that extracts data from wmf.pageview_hourly and wmf.projectview_hourly, except for the fact that it applies a threshold-based anonymization to it. For privacy and readability reasons, all the rows that initially represent less than K% of the total view count are grouped into a single row where all dimension columns (os_family, os_major, browser_family, browser_major) are assigned an unknown value (see job config for exact K and unknown values).

Example query

SELECT
   browser_family,
   browser_major,
   SUM(view_count) as view_count
FROM wmf.browser_general
WHERE
   access_method IN ('desktop', 'mobile web') AND
   CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2016-01-01' AND
   CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) < '2016-02-01'
GROUP BY
   browser_family,
   browser_major
ORDER BY view_count DESC
LIMIT 1000;

Changes and known problems since 2016-03-21

Date from Date until Task Details
beginning now Task T148461 Inconsistent data in #all-sites-by-os-and-browser for IE7
... ... ... ...