Data Platform/Data Lake/Traffic/mediawiki api request

event.mediawiki_api_request

The event.mediawiki_api_request table is api.php request data. The data is produced by the ApiMain::logRequest() method in MediaWiki to the "api-request" debug logging channel. This logging channel is encoded as JSON messages and handled by the EventBus extension. The JSON Events are sent to EventGate and then sent to Kafka. Gobblin imports the JSON data from Kafka, and a 'Refine' job uses the JSONSchema to create a Hive table and import the data hourly.

Schema

The Mediawiki event JSONSchemas are stored in the schemas/event/primary git repository. mediawiki/api/request can be found here: https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/api/request.

As of 2018-05, the Hive Schema is

_schema             	string
meta                	struct<uri:string,request_id:string,id:string,dt:string,domain:string,stream:string>
http                	struct<method:string,client_ip:string,request_headers:map<string,string>>
database            	string
backend_time_ms     	bigint
api_error_codes     	array<string>
params              	map<string,string>
datacenter          	string
year                	bigint
month               	bigint
day                 	bigint
hour                	bigint

# Partition Information
# col_name            	data_type           	comment

datacenter          	string
year                	bigint
month               	bigint
day                 	bigint
hour                	bigint

And the JSONSchema is:

title: mediawiki/api/request
description: A single request to the MediaWiki Action API (api.php)
$id: /mediawiki/api/request/0.0.1
$schema: 'https://json-schema.org/draft-07/schema#'
type: object
required:
  - $schema
  - meta
properties:
  $schema:
    type: string
    description: >
      A URI identifying the JSONSchema for this event. This should match an
      schema's $id in a schema repository. E.g. /schema/title/1.0.0
  meta:
    type: object
    required:
      - dt
      - stream
    properties:
      uri:
        type: string
        format: uri-reference
        maxLength: 8192
        description: Unique URI identifying the event or entity
      request_id:
        type: string
        description: Unique ID of the request that caused the event
      id:
        type: string
        description: Unique ID of this event
      dt:
        type: string
        format: date-time
        maxLength: 128
        description: 'UTC event datetime, in ISO-8601 format'
      domain:
        type: string
        description: Domain the event or entity pertains to
        minLength: 1
      stream:
        type: string
        description: Name of the stream/queue/dataset that this event belongs in
        minLength: 1
  http:
    type: object
    description: Information about the HTTP request that generated this event.
    properties:
      method:
        type: string
        description: 'The HTTP request method (GET, POST, etc.)'
      client_ip:
        type: string
        description: The http client's IP address
      request_headers:
        type: object
        description: Request headers sent by the client.
        additionalProperties:
          type: string
  database:
    type: string
    description: The name of the wiki database this event belongs to.
    minLength: 1
  backend_time_ms:
    type: integer
    description: Elapsed milliseconds to process request as measured by MediaWiki
  api_error_codes:
    type: array
    description: >
      List of error codes for any errors encountered, e.g.
      'apierror-missingparam' If this field is missing or empty, no errors were
      encountered.
    items:
      type: string
  params:
    type: object
    description: >
      Request parameters. These might come from http query string, or from the
      POST body. Sensitive params are redacted.
    additionalProperties:
      type: string

wmf.action_*

Tracked in Phabricator
Task T137321

These tables were in bd808.action_* but are no longer maintained pending the resolution of T137321

The raw data from wmf_raw.ApiAction is used to compute various dimensional rollup tables for easier analysis.

-- Create tables for Action API stats
--
-- Usage:
--     hive -f create-action-tables.sql --database wmf

CREATE TABLE IF NOT EXISTS action_ua_hourly (
  userAgent STRING COMMENT 'Raw user-agent',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by user-agent and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;


CREATE EXTERNAL TABLE IF NOT EXISTS action_action_hourly (
  action    STRING COMMENT 'Action parameter value',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;


CREATE EXTERNAL TABLE IF NOT EXISTS action_param_hourly (
  action    STRING COMMENT 'Action parameter value',
  param     STRING COMMENT 'Parameter name, can be prop, list, meta, generator, etc',
  value     STRING COMMENT 'Parameter value',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action, parameter, value and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;

-- NOTE: there are many params we do not want to count distinct values of
-- at all (eg maxlag, smaxage, maxage, requestid, origin, centralauthtoken,
-- titles, pageids). Rather than trying to make an extensive blacklist and
-- potentially allow new parameters to slip through which have high
-- cardinality or sensitive information, the ETL process will use a whitelist
-- approach to count params that have been deemed to be useful.
--
-- The initial whitelist is (query, prop), (query, list), (query, meta),
-- (flow, module), (*, generator). The prop, list and meta parameters will
-- additionally be split on '|' with each component counted separately.