Analytics/Data Lake/Traffic/referrer daily
referrer_daily (available in the
wmf database on Hive) contains pre-aggregated counts of how many Wikipedia pageviews were referred from common search engines on a given day. They split the data by country, language edition, browser family, and OS family. Given that this table contains sensitive geographic content, a privacy threshold of 500 is enforced such that any set of facets (search engine, country, language, OS family, browser family) that did not refer at least 500 pageviews is represented by
other. This retains accurate complete counts of search engine referrals while reducing privacy risks.
hive (default)> DESCRIBE wmf.referrer_daily; # col_name data_type comment country string Reader country per IP geolocation lang string Wikipedia language -- e.g., en for English browser_family string Browser family from user-agent os_family string OS family from user-agent search_engine string One of ~20 standard search engines (e.g., Google) num_referrals int Number of pageviews from the referral source year int Unpadded year of request month int Unpadded month of request day int Unpadded day of request # Partition Information # col_name data_type comment year int Unpadded year of request month int Unpadded month of request day int Unpadded day of request
As of May 2021, the following search engines are tracked by this dataset: Google, Yahoo, Bing, Yandex, Baidu, DuckDuckGo, Ecosia, Startpage, Naver, Docomo, Qwant, Daum, MyWay, Seznam, AU, Ask, Lilo, Coc Coc, AOL, and Rakuten. You can see the regexes that are used for each search engine here. Periodically externally-referred traffic is checked to identify any new search engines that should be captured (example but the above search engines appear to capture the vast vast majority of search-engine-based traffic. Note that Wikimedia does not have any data on search queries that come via voice assistants such Amazon Alexa or Apple Siri and thus they are not part of this dataset.
Beyond Hive, the data is available in various other places.
Any stat machine with access to Hadoop can access daily TSV dumps of the data at
Given the many, orthogonal facets to this data -- e.g., one person may want to aggregate by country while another might want to aggregate by language -- this data is also made available via a prototype public Turnilo instance. See Dashboard main page for more information.
For more details and discussion around the privacy risks of this dataset, see task T270140.
- The code that generates it:
- Related datasets and dashboards:
- Aggregate search engine traffic: https://discovery.wmflabs.org/external/#traffic_by_engine
- Browser and OS pageview breakdowns: https://analytics.wikimedia.org/dashboards/browsers/
- Clickstream dataset: https://dumps.wikimedia.org/other/clickstream/readme.html