Analytics/Data Lake/Traffic/mediawiki api request

From Wikitech
Jump to navigation Jump to search

event.mediawiki_api_request

The event.mediawiki_api_request table is api.php request data. The data is produced by the ApiMain::logRequest() method in MediaWiki to the "api-request" debug logging channel. This logging channel is encoded as JSON messages and handled by the EventBus extension. The JSON Events are sent to EventGate and then sent to Kafka. Camus imports the JSON data from Kafka, and a 'Refine' job uses the JSONSchema to create a Hive table and import the data hourly.

Schema

The Mediawiki event JSONSchemas are stored in the mediawiki/event-schemas git repository. mediawiki/api/request can be found here: https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/api/request/.


As of 2018-05, the Hive Schema is

_schema             	string
meta                	struct<uri:string,request_id:string,id:string,dt:string,domain:string,stream:string>
http                	struct<method:string,client_ip:string,request_headers:map<string,string>>
database            	string
backend_time_ms     	bigint
api_error_codes     	array<string>
params              	map<string,string>
datacenter          	string
year                	bigint
month               	bigint
day                 	bigint
hour                	bigint

# Partition Information
# col_name            	data_type           	comment

datacenter          	string
year                	bigint
month               	bigint
day                 	bigint
hour                	bigint


And the JSONSchema is:

title: mediawiki/api/request
description: A single request to the MediaWiki Action API (api.php)
$id: /mediawiki/api/request/0.0.1
$schema: http://json-schema.org/draft-07/schema#
type: object
properties:
  # global event fields
  $schema:
    type: string
    description: >
      The URI identifying the jsonschema for this event. This may be just
      a short uri containing only the name and revision at the end of the
      URI path.  e.g. /schema_name/12345 is acceptable. This often will
      (and should) match the schema's $id field.
  meta:
    type: object
    properties:
      uri:
        type: string
        format: uri-reference
        maxLength: 8192
        description: the unique URI identifying the event / resource
      request_id:
        type: string
        description: The unique ID of the request that caused the event.
      id:
        type: string
        pattern: '^[a-fA-F0-9]{8}(-[a-fA-F0-9]{4}){3}-[a-fA-F0-9]{12}$'
        maxLength: 36
        description: the unique ID of this event; should match the dt field
      # -- ApiAction .ts
      dt:
        type: string
        format: date-time
        maxLength: 26
        description: the time stamp of the event, in ISO8601 format
      domain:
        type: string
        description: the domain the event pertains to
        minLength: 1
      stream:
        type: string
        description: The name of the stream/queue that this event belongs in.
        minLength: 1
    required:
      - id
      - dt
      - domain
      - stream

  # http request/response info
  http:
    type: object
    description: Information about the HTTP request that generated this event.
    properties:
      method:
        type: string
        description: The HTTP request method (GET, POST, etc.)
      client_ip:
        type: string
        description: The http client's IP address
      request_headers:
        type: object
        description: Request headers sent by the client.
        additionalProperties:
          type: string

  # event-specific fields

  # -- ApiAction .wiki
  database:
    type: string
    description: The name of the wiki database this event belongs to.
    minLength: 1

  backend_time_ms:
    type: integer
    description: Elapsed milliseconds to process request as measured by MediaWiki

  api_error_codes:
    type: array
    description: >
      List of error codes for any errors encountered, e.g. 'apierror-missingparam'
      If this field is missing or empty, no errors were encountered.
    items:
      type: string

  params:
    type: object
    description: >
      Request parameters. These might come from http query string,
      or from the POST body. Sensitive params are redacted.
    additionalProperties:
      type: string

wmf_raw.ApiAction

2018-05: This table is being deprecated in favor of event.mediawiki_api_request

The wmf_raw.ApiAction table is 'raw' api.php request data. The data is produced by the ApiMain::logRequest() method in MediaWiki to the "ApiAction" debug logging channel. This logging channel is encoded as binary Avro messages and delivered to Kafka. An Oozie job consumes the Avro data from the "mediawiki_ApiAction" channel and stores it in HDFS.

Avro schema

{
  "type": "record",
  "name": "ApiAction",
  "namespace": "org.wikimedia.analytics.schemas",
  "doc": "A single request to the MediaWiki Action API (api.php)",
  "fields": [
    {
      "name": "ts",
      "doc": "The timestamp, in unix time, that the request was made",
      "type": "int",
      "default": 0
    },
    {
      "name": "ip",
      "doc": "The requesting IP address (either ipv4 or ipv6) in string notation",
      "type": "string",
      "default": ""
    },
    {
      "name": "userAgent",
      "doc": "The HTTP User-Agent header prefixed with the Api-user-agent header if present",
      "type": "string",
      "default": ""
    },
    {
      "name": "wiki",
      "doc": "The wiki making this request, such as dewiki or enwiktionary",
      "type": "string",
      "default": ""
    },
    {
      "name": "timeSpentBackend",
      "doc": "Elapsed milliseconds to process request as measured by MediaWiki",
      "type": "int",
      "default": -1
    },
    {
      "name": "hadError",
      "doc": "Were errors encountered when processing this request?",
      "type": "boolean",
      "default": false
    },
    {
      "name": "errorCodes",
      "doc": "List of error codes for any errors encountered",
      "type": {"type": "array", "items": "string"},
      "default": []
    },
    {
      "name": "params",
      "doc": "Request parameters",
      "type": {"type": "map", "values": "string"},
      "default": {}
    }
  ]
}

wmf.action_*

The raw data from wmf_raw.ApiAction is used to compute various dimensional rollup tables for easier analysis.

-- Create tables for Action API stats
--
-- Usage:
--     hive -f create-action-tables.sql --database wmf

CREATE TABLE IF NOT EXISTS action_ua_hourly (
  userAgent STRING COMMENT 'Raw user-agent',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by user-agent and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;


CREATE EXTERNAL TABLE IF NOT EXISTS action_action_hourly (
  action    STRING COMMENT 'Action parameter value',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;


CREATE EXTERNAL TABLE IF NOT EXISTS action_param_hourly (
  action    STRING COMMENT 'Action parameter value',
  param     STRING COMMENT 'Parameter name, can be prop, list, meta, generator, etc',
  value     STRING COMMENT 'Parameter value',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action, parameter, value and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;

-- NOTE: there are many params we do not want to count distinct values of
-- at all (eg maxlag, smaxage, maxage, requestid, origin, centralauthtoken,
-- titles, pageids). Rather than trying to make an extensive blacklist and
-- potentially allow new parameters to slip through which have high
-- cardinality or sensitive information, the ETL process will use a whitelist
-- approach to count params that have been deemed to be useful.
--
-- The initial whitelist is (query, prop), (query, list), (query, meta),
-- (flow, module), (*, generator). The prop, list and meta parameters will
-- additionally be split on '|' with each component counted separately.