Jump to content

Data Platform/Data Lake/Traffic/Pageview actor

From Wikitech

The wmf.pageview_actor table (available on Hive) contains filtered webrequest data to keep only pageviews and redirects to pageviews. It keeps most dimensions from webrequest, has an updated agent_type value flagging traffic estimated automated, and offers the actor_signature field facilitating in-project session-fingerprinting. It is stored in the Parquet columnar file format and partitioned by (year, month, day, hour). As webrequest, the data is deleted after 90 days.

This intermediary dataset is meant to be used as a replacement for webrequest when queries filter for pageviews. The reason it should be used is because it is about ten times smaller than webrequest for the same time-period, and therefore is a lot faster to query. For instance, production jobs generating pageview_hourly, unique-devices or clickstream take advantage of this table.

Note: This table doesn't aggregate rows as pageview_hourly does, it only filters.

Current schema

To view the schema and field-level documentation, see the DataHub entry.

Sample queries

SELECT
  concat(month,'-',day,'-',year), agent_type, count(1)
FROM 
  wmf.pageview_actor
WHERE
  year = 2020
  AND month = 6
  AND day = 25
  AND is_pageview
GROUP BY
  year, month, day, agent_type;

Changes and known problems since 2020-06-25

See also Analytics/Data_Lake/Traffic/Webrequest#Changes and known problems since 2015-03-04 for issues and updates affecting all webrequests (including non-pageviews)
Date from Task Details
2020-06-01[1] task T225467 Create the table and start filtering data.

See also

Notes

  1. Action taken on the 25th of June, but data has been backfilled from the beginning of the month