Analytics/Data Lake/Traffic/Pageview actor

From Wikitech

The wmf.pageview_actor table (available on Hive) contains filtered webrequest data to keep only pageviews and redirects to pageviews. It keeps most dimensions from webrequest, has an updated agent_type value flagging traffic estimated automated, and offers the actor_signature field facilitating in-project session-fingerprinting. It is stored in the Parquet columnar file format and partitioned by (year, month, day, hour). As webrequest, the data is deleted after 90 days.

This intermediary dataset is meant to be used as a replacement for webrequest when queries filter for pageviews. The reason it should be used is because it is about ten times smaller than webrequest for the same time-period, and therefore is a lot faster to query. For instance, production jobs generating pageview_hourly, unique-devices or clickstream take advantage of this table.

Note: This table doesn't aggregate rows as pageview_hourly does, it only filters.

Current Schema

$ hive --database wmf

hive (wmf)> describe pageview_actor;
OK
col_name	data_type	comment
hostname            	string              	Source node hostname
sequence            	bigint              	Per host sequence number
dt                  	string              	Timestame at cache in ISO 8601
time_firstbyte      	double              	Time to first byte
ip                  	string              	IP of packet at cache
http_status         	string              	HTTP status of response
response_size       	bigint              	Response size
uri_host            	string              	Host of request
uri_path            	string              	Path of request
uri_query           	string              	Query of request
content_type        	string              	Content-Type header of response
referer             	string              	Referer header of request
user_agent          	string              	User-Agent header of request
accept_language     	string              	Accept-Language header of request
is_pageview         	boolean             	Indicates if this record was marked as a pageview during refinement
is_redirect_to_pageview	boolean             	Indicates if this record was marked as a redirect to a pageview during extraction (needed for unique-devices-per-project-family)
geocoded_data       	map<string,string>  	Geocoded map with continent, country_code, country, city, subdivision, postal_code, latitude, longitude, timezone keys and associated values.
user_agent_map      	map<string,string>  	User-agent map with browser_family, browser_major, device_family, os_family, os_major, os_minor and wmf_app_version keys and associated values
x_analytics_map     	map<string,string>  	X_analytics map view of the x_analytics field
ts                  	timestamp           	Unix timestamp in milliseconds extracted from dt
access_method       	string              	Method used to access the site (mobile app|mobile web|desktop)
agent_type          	string              	Categorise the agent making the webrequest as either user, spider or automated
referer_class       	string              	Indicates if a referer is internal, external or unknown.
normalized_host     	struct<project_class:string,project:string,qualifiers:array<string>,tld:string,project_family:string>	struct containing project_family (such as wikipedia or wikidata for instance), project (such as en or commons), qualifiers (a list of in-between values, such as m) and tld (org most often)
pageview_info       	map<string,string>  	map containing project, language_variant and page_title values only when is_pageview = TRUE.
page_id             	bigint              	MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview.
namespace_id        	int                 	MediaWiki namespace_id for this page title. This may not always be set, even if the page is actually a pageview.
actor_signature     	string              	The actor signature for the record using domain, computed as a hash
actor_signature_per_project_family	string              	The actor signature for the record using project-family, computed as a hash
year                	int                 	Unpadded year of pageviews
month               	int                 	Unpadded month of pageviews
day                 	int                 	Unpadded day of pageviews
hour                	int                 	Unpadded hour of pageviews

# Partition Information
# col_name            	data_type           	comment

year                	int                 	Unpadded year of pageviews
month               	int                 	Unpadded month of pageviews
day                 	int                 	Unpadded day of pageviews
hour                	int                 	Unpadded hour of pageviews

Notice the year, month, day, and hour fields. These are Hive partitions, and are explicit mappings to hourly folders in HDFS. You must include at least one partition predicate in the where clause of your queries (even if it is just year > 0). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in data during a particular day, you could add where year = 2014 and month = 1 and day = 12. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

Sample queries

SELECT
  concat(month,'-',day,'-',year), agent_type, count(1)
FROM 
  wmf.pageview_actor
WHERE
  year = 2020
  AND month = 6
  AND day = 25
  AND is_pageview
GROUP BY
  year, month, day, agent_type;

Changes and known problems since 2020-06-25

See also Analytics/Data_Lake/Traffic/Webrequest#Changes and known problems since 2015-03-04 for issues and updates affecting all webrequests (including non-pageviews)
Date from Task Details
2020-06-01[1] task T225467 Create the table and start filtering data.

See also

Notes

  1. Action taken on the 25th of June, but data has been backfilled from the beginning of the month