Data Platform/Data Lake/Traffic/mediawiki api request
event.mediawiki_api_request
The event.mediawiki_api_request table is api.php request data. The data is produced by the ApiMain::logRequest()
method in MediaWiki to the "api-request" debug logging channel. This logging channel is encoded as JSON messages and handled by the EventBus extension. The JSON Events are sent to EventGate and then sent to Kafka. Gobblin imports the JSON data from Kafka, and a 'Refine' job uses the JSONSchema to create a Hive table and import the data hourly.
Schema
The Mediawiki event JSONSchemas are stored in the schemas/event/primary git repository. mediawiki/api/request can be found here: https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/api/request.
As of 2018-05, the Hive Schema is
_schema string meta struct<uri:string,request_id:string,id:string,dt:string,domain:string,stream:string> http struct<method:string,client_ip:string,request_headers:map<string,string>> database string backend_time_ms bigint api_error_codes array<string> params map<string,string> datacenter string year bigint month bigint day bigint hour bigint # Partition Information # col_name data_type comment datacenter string year bigint month bigint day bigint hour bigint
And the JSONSchema is:
title: mediawiki/api/request
description: A single request to the MediaWiki Action API (api.php)
$id: /mediawiki/api/request/0.0.1
$schema: 'https://json-schema.org/draft-07/schema#'
type: object
required:
- $schema
- meta
properties:
$schema:
type: string
description: >
A URI identifying the JSONSchema for this event. This should match an
schema's $id in a schema repository. E.g. /schema/title/1.0.0
meta:
type: object
required:
- dt
- stream
properties:
uri:
type: string
format: uri-reference
maxLength: 8192
description: Unique URI identifying the event or entity
request_id:
type: string
description: Unique ID of the request that caused the event
id:
type: string
description: Unique ID of this event
dt:
type: string
format: date-time
maxLength: 128
description: 'UTC event datetime, in ISO-8601 format'
domain:
type: string
description: Domain the event or entity pertains to
minLength: 1
stream:
type: string
description: Name of the stream/queue/dataset that this event belongs in
minLength: 1
http:
type: object
description: Information about the HTTP request that generated this event.
properties:
method:
type: string
description: 'The HTTP request method (GET, POST, etc.)'
client_ip:
type: string
description: The http client's IP address
request_headers:
type: object
description: Request headers sent by the client.
additionalProperties:
type: string
database:
type: string
description: The name of the wiki database this event belongs to.
minLength: 1
backend_time_ms:
type: integer
description: Elapsed milliseconds to process request as measured by MediaWiki
api_error_codes:
type: array
description: >
List of error codes for any errors encountered, e.g.
'apierror-missingparam' If this field is missing or empty, no errors were
encountered.
items:
type: string
params:
type: object
description: >
Request parameters. These might come from http query string, or from the
POST body. Sensitive params are redacted.
additionalProperties:
type: string
wmf.action_*
bd808.action_*
but are no longer maintained pending the resolution of T137321The raw data from wmf_raw.ApiAction is used to compute various dimensional rollup tables for easier analysis.
-- Create tables for Action API stats
--
-- Usage:
-- hive -f create-action-tables.sql --database wmf
CREATE TABLE IF NOT EXISTS action_ua_hourly (
userAgent STRING COMMENT 'Raw user-agent',
wiki STRING COMMENT 'Target wiki (e.g. enwiki)',
ipClass STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by user-agent and wiki'
PARTITIONED BY (
year INT COMMENT 'Unpadded year of request',
month INT COMMENT 'Unpadded month of request',
day INT COMMENT 'Unpadded day of request',
hour INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;
CREATE EXTERNAL TABLE IF NOT EXISTS action_action_hourly (
action STRING COMMENT 'Action parameter value',
wiki STRING COMMENT 'Target wiki (e.g. enwiki)',
ipClass STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action and wiki'
PARTITIONED BY (
year INT COMMENT 'Unpadded year of request',
month INT COMMENT 'Unpadded month of request',
day INT COMMENT 'Unpadded day of request',
hour INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;
CREATE EXTERNAL TABLE IF NOT EXISTS action_param_hourly (
action STRING COMMENT 'Action parameter value',
param STRING COMMENT 'Parameter name, can be prop, list, meta, generator, etc',
value STRING COMMENT 'Parameter value',
wiki STRING COMMENT 'Target wiki (e.g. enwiki)',
ipClass STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action, parameter, value and wiki'
PARTITIONED BY (
year INT COMMENT 'Unpadded year of request',
month INT COMMENT 'Unpadded month of request',
day INT COMMENT 'Unpadded day of request',
hour INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;
-- NOTE: there are many params we do not want to count distinct values of
-- at all (eg maxlag, smaxage, maxage, requestid, origin, centralauthtoken,
-- titles, pageids). Rather than trying to make an extensive blacklist and
-- potentially allow new parameters to slip through which have high
-- cardinality or sensitive information, the ETL process will use a whitelist
-- approach to count params that have been deemed to be useful.
--
-- The initial whitelist is (query, prop), (query, list), (query, meta),
-- (flow, module), (*, generator). The prop, list and meta parameters will
-- additionally be split on '|' with each component counted separately.