Analytics/Data Lake/Traffic/referrer daily

From Wikitech
Jump to navigation Jump to search

The table referrer_daily contains pre-aggregated counts of how many Wikipedia pageviews were referred from common search engines on a given day. They split the data by country, language edition, browser family, and OS family. Given that this table contains sensitive geographic content, a privacy threshold of 500 is enforced such that any set of facets (search engine, country, language, OS family, browser family) that did not refer at least 500 pageviews is represented by other. This retains accurate complete counts of search engine referrals while reducing privacy risks.

This table is available in an Iceberg table under database wmf_traffic.

Schema

spark-sql (default)> describe wmf_traffic.referrer_daily;
col_name	data_type	comment
country	string	Reader country per IP geolocation
lang	string	Wikipedia language -- e.g., en for English
browser_family	string	Browser family from user-agent
os_family	string	OS family from user-agent
search_engine	string	One of ~20 standard search engines (e.g., Google)
num_referrals	int	Number of pageviews from the referral source
day	date	The date of the request
		
# Partitioning		
Part 0	months(day)

This Iceberg version allows you to query data by the day column which is of type DATE. Example: SELECT * FROM wmf_traffic.referrer_daily WHERE day = '2020-05-01';. There is no need to specify partitioning details in your query; it is inferred from any day clauses you include.

Search Engines

As of April 2023, the following search engines are tracked by this dataset: Google, Google Translate, Yahoo, Bing, Yandex, Baidu, DuckDuckGo, Ecosia, Startpage, Naver, Docomo, Qwant, Daum, MyWay, Seznam, AU, Ask, Lilo, Coc Coc, AOL, Rakuten, Brave, Petal, and VK. You can see the regexes that are used for each search engine. Periodically externally-referred traffic is checked to identify any new search engines that should be captured (example task) but the above search engines appear to capture the vast majority of search-engine-based traffic. Changes to the list can be found in the general changes / issues table for the webrequest table from which this data is derived. Note that Wikimedia does not have any data on search queries that come via voice assistants such Amazon Alexa or Apple Siri and thus they are not part of this dataset.

Availability

Beyond this table, the data is available in various other places.

Stat machines

Any stat machine with access to Hadoop can access daily TSV dumps of the data at /mnt/hdfs/wmf/data/archive/referrer/daily.

Dashboard

Given the many, orthogonal facets to this data -- e.g., one person may want to aggregate by country while another might want to aggregate by language -- this data is also made available via a prototype public Turnilo instance. See Dashboard main page for more information.

Privacy

For more details and discussion around the privacy risks of this dataset, see task T270140.

See also