Analytics/Systems/EventLogging

From Wikitech
Jump to: navigation, search

EventLogging (also EL for short) is a platform for modelling, logging and processing arbitrary analytic data. It consists of an event-processing back-end which aggregates events, validates them for compliance with pre-declared data models, and streams them to clients. There is also a MediaWiki extension that provides JavaScript and PHP APIs for logging events (Extension:EventLogging). The back-end is implemented in Python. This documentation is about the Wikimedia Foundation installation of EventLogging. To learn know more about the MediaWiki extension, refer to https://www.mediawiki.org/wiki/Extension:EventLogging.

EventLogging architecture

For users

Schemas

Here's the list of the existing schemas. Note that many of them are active, but not all. Some schemas are still in development (not active yet) and others may be obsolete and listed for historical reference.

https://meta.wikimedia.org/wiki/Research:Schemas

The schema's discussion page is the place to comment on the schema design and related topics. It contains a template that specifies: the schema maintainer, the team and project the schema belongs to, its status (active, inactive, in development), and its purging strategy.

Creating a schema

There's thorough documentation on designing and creating a new schema here:

https://www.mediawiki.org/wiki/Extension:EventLogging/Guide#Creating_a_schema

These are some special guidelines to create a schema that druid can digest easily: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines

Please, don't forget to fill in the schema's talk page with the template: https://meta.wikimedia.org/wiki/Template:SchemaDoc. Note that for new schemas the default purging strategy is: automatically purge all events older than 90 days (see Data retention and purging section).

Send events

See Extension:EventLogging/Programming for how to instrument your MediaWiki code.

Verify received events

Validation errors are visible on application logs located at

/var/log/upstart

In production they also end up in the kafka topic

eventlogging_EventError

The processor is the one that handles validation, so, for example;

eventlogging_processor-client-side-<some>.log 

will have an error like the following if events are invalid

 Unable to validate: ?{"event":  {"pagename":"Recentchanges","namespace":null,"invert":false,"associated":false,"hideminor":false,"hidebots":true,"hideanons":false,"hideliu":false,"hidepatrolled":false,"hidemyself":false,"hidecategorization":true,"tagfilter":null},"schema":"ChangesListFilters","revision":15876023,"clientValidated":false,"wiki":"nowikimedia","webHost":"no.wikimedia.org","userAgent":"Apple-PubSub/65.28"}; cp1066.eqiad.wmnet 42402900 2016-09-26T07:01:42 -

 

This happens if client code has a bug and is sending events that are not valid according to the schema, we noramlly try to identify the schema at fault and pas that info back to the devs so they can fix it. See a ticket of how do we deal with these errors: https://phabricator.wikimedia.org/T146674

Validation error logs are also visible in the eventlogging-errors Logstash dashboard for up to 30 days. Access to Logstash requires an LDAP account with membership in a user group indicating that the user has signed an NDA.

Data retention and purging

Starting in August 2016, all EventLogging data will by default be purged after 90 days to comply with WMF's data retention guidelines. Individual columns within tables can be white-listed so that the data is retained indefinitely in the MariaDB databases; generally, all columns can be whitelisted, except the clientIp and userAgent fields. To have your data added to this whitelist, contact Analytics. Note that new schemas will be purged by default.

Read more on this topic at Analytics/EventLogging/Data retention and auto-purging. For implementation details, see T108850.

Accessing data

MariaDB

Data stored by EventLogging for the various schemas has varying degrees of privacy, including personally identifiable information and sensitive information, hence access to it requires an NDA.

See Analytics/EventLogging/Data representations for an explanation on where the data lives and how to access it.

See also: Analytics/Data access#Analytics slaves.

Sample query

Note that you need to have ssh access to stat1006 and also be authorized to access the db.

On stat1006.eqiad.wmnet, type this command:

mysql --defaults-extra-file="/etc/mysql/conf.d/research-client.cnf" -h analytics-store.eqiad.wmnet -e "select left(timestamp,8) ts , COUNT(*) from log.NavigationTiming_10785754 where timestamp >= '20150402062613' group by ts order by ts";

Hadoop

Hadoop. Archived Data

Some big tables were archived from MySQL to Hadoop. Tables were exported with sqoop into avro format files and tables were created according to schema. Thus far we have these tables archived in hadoop, both in archive database, this means that they are not available on mysql:

mobilewebuiclicktracking_10742159_15423246 
MobileWikiAppToCInteraction_10375484_15423246
pagecontentsavecomplete_5588433_15423246
MediaViewer_10867062_15423246
PageCreation_7481635
PageCreation_7481635_15423246
PageDeletion_7481655
PageDeletion_7481655_15423246


You can query these tables just like any other table in hive. A tip regarding dealing with binary types:

select *  from Some_tbl where (cast(uuid as string) )='ed663031e61452018531f45b4b5502cb';

Caveat: This process does not preserve the data type for e.g. bigint or boolean fields. The archived Hive table will contain them as strings instead which will need to be converted back (e.g. CAST(field AS BIGINT)).

Hadoop. Live Data

EventLogging data is imported hourly into Hadoop by Camus. It is written to directories named after each schema in hourly partitions in HDFS. /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_<schema>/hourly/<year>/<month>/<day>/<hour>. There are a myriad of ways to access this data, including Hive and Spark. Below are a few examples. There may be many (better!) ways to do this.

Advantages of processing EL data in Hadoop (lightning talk slide)

Note that all EventLogging data in Hadoop is automatically purged after 90 days; the whitelist of fields to retain is not used, but this feature could be added in the future if there is sufficient demand.

Hive

Hive has a couple of built in functions for parsing JSON. Since EventLogging records are stored as JSON strings, you can access this data by creating a Hive table with a single string column and then parsing that string in your queries:

ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;

-- Make sure you don't create tables in the default Hive database.
USE otto;

-- Create a table with a single string field
CREATE EXTERNAL TABLE `CentralNoticeBannerHistory` (
  `json_string` string
)
PARTITIONED BY (
  year int,
  month int,
  day int,
  hour int
)
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 
  '/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory';

-- Add a partition
ALTER TABLE CentralNoticeBannerHistory
ADD PARTITION (year=2015, month=9, day=17, hour=16)
LOCATION '/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2015/09/17/16';

-- Parse the single string field as JSON and select a nested key out of it
SELECT get_json_object(json_string, '$.event.l.b') as banner_name
FROM CentralNoticeBannerHistory
WHERE year=2015;

Use UAParserUDF to work with userAgents:

ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION ua as 'org.wikimedia.analytics.refinery.hive.UAParserUDF';

SELECT a.user_agent_map["browser_family"] as browser_family, count(*) FROM (
  SELECT ua(get_json_object(json_string, '$.userAgent')) AS user_agent_map
  FROM CentralNoticeBannerHistory
  WHERE year=2015 and month=9 and day=17 and hour=16
) AS a
GROUP BY a.user_agent_map["browser_family"];
Spark

Spark Python (pyspark):

import json
data = sc.sequenceFile("/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2015/09/17/07")
records = data.map(lambda x: json.loads(x[1]))
records.map(lambda x: (x['event']['l'][0]['b'], 1)).countByKey()
Out[33]: defaultdict(<class 'int'>, {'WMES_General_Assembly': 5})

MobileWikiAppFindInPage events with SparkSQL in Spark Python (pyspark):

# Load the JSON string values out of the compressed sequence file.
# Note that this uses * globs to expand to all data in 2016.
data = sc.sequenceFile(
    "/wmf/data/raw/eventlogging/eventlogging_MobileWikiAppFindInPage/hourly/2016/*/*/*"
).map(lambda x: x[1])

# parse the JSON strings into a DataFrame
json_data = sqlCtx.jsonRDD(data)
# Register this DataFrame as a temp table so we can use SparkSQL.
json_data.registerTempTable("MobileWikiAppFindInPage")

top_k_page_ids = sqlCtx.sql(
"""SELECT event.pageID, count(*) AS cnt
    FROM MobileWikiAppFindInPage
    GROUP BY event.pageID
    ORDER BY cnt DESC
    LIMIT 10"""
)
for r in top_k_page_ids.collect():
    print "%s: %s" % (r.pageID, r.cnt)

Edit events with SparkSQL in Spark scala (spark-shell):

// Load the JSON string values out of the compressed sequence file
val edit_data = sc.sequenceFile[Long, String](
    "/wmf/data/raw/eventlogging/eventlogging_Edit/hourly/2015/10/21/16"
).map(_._2)

// parse the JSON strings into a DataFrame
val edits = sqlContext.jsonRDD(edit_data)
// Register this DataFrame as a temp table so we can use SparkSQL.
edits.registerTempTable("edits")

// SELECT top 10 edited wikis
val top_k_edits = sqlContext.sql(
    """SELECT wiki, count(*) AS cnt
    FROM edits
    GROUP BY wiki
    ORDER BY cnt DESC
    LIMIT 10"""
)
// Print them out
top_k_edits.foreach(println)

Kafka

There are many Kafka tools with which you can read the EventLogging data streams. kafkacat is one that is installed on stat1005.

# Uses kafkacat CLI to print window ($1)
# seconds of data from $topic ($2)
function kafka_timed_subscribe {
    timeout $1 kafkacat -C -b kafka1012 -t $2
}

# Prints the top K most frequently
# occurring values from stdin.
function top_k {
    sort        |
    uniq -c     |
    sort -nr    |
    head -n $1
}

while true; do
    date; echo '------------------------------' 
    # Subscribe to eventlogging_Edit topic for 5 seconds
    kafka_timed_subscribe 5 eventlogging_Edit |
    # Filter for the "wiki" field 
    jq .wiki |
    # Count the top 10 wikis that had the most edits
    top_k 10
    echo ''
done

Generating reports and dashboards

In addition to ad-hoc queries, there are a couple of tools that make it easy to generate periodic reports on EventLogging data and display them in the form of dashboards. You can find more info on them here:

Publishing data

See Analytics/EventLogging/Publishing for how to proceed if you want to publish reports based on EventLogging data, or datasets that contain EventLogging data.

Operational support

Tier 2 support

Analytics/Tier2

Outages

Any outages that affect EventLogging will be tracked on Incident documentation (also listed below) and announced to the lists eventlogging-alerts@lists.wikimedia.org and ops@lists.wikimedia.org.

Alarms

Alarms at this time come to the Analytics team. We are working on being able to claim alarms in icinga.

Contact

You can contact the analytics team at: analytics@lists.wikimedia.org

For developers

Codebase

The EventLogging python codebase can be found at https://gerrit.wikimedia.org/r/#/admin/projects/eventlogging

Architecture

See Analytics/EventLogging/Architecture for EventLogging architecture.

Performance

On this page you'll find information about Event Logging performance, such as load tests and benchmarks:

https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Performance

Size limitation

There is a limitation of the size of individual EventLogging events due the underlying infrastructure (limited size of urls in Varnish's varnishncsa/ varnishlog, as well as Wikimedia UDP packets). For the purpose of size limitation, an "entry" is a /beacon request URL containing urlencoded JSON-stringified event data. Entries longer than 1014 bytes are truncated. When an entry is truncated, it will fail validation because of parsing (as the result is invalid JSON).

This should be taken into account when creating a schema. Large schemas should be avoided and schema fields with long keys and/or values, too. Consider splitting up a very large schema, or replacing long fields with shorter ones.

To aid with testing the length of schemas, EventLogging's dev-server logs a warning into the console for each event that exceeds the size limit.

Monitoring

You can use various tools to monitor operational metrics, read more in this dedicated page:

https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Monitoring

Testing

The Event Logging extension can be tested on vagrant easily and that is described on mediawiki.org at Extension:EventLogging. The server side of EventLogging (consumer of events) does not have a vagrant setup for testing but can be tested in the Beta Cluster:

https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/TestingOnBetaCluster

How do I ...?

Visit this EventLogging how to page. It contains some dev-ops tips and tricks for EventLogging like: deploying, troubleshooting, restarting, etc. Please, add here any step-by-step on EventLogging dev-ops tasks.

https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/How_to

Administration. On call

Here's a list of routine tasks to do when oncall for EventLogging.

https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall


Data Quality Issues

Changes and Known Problems with Dataset

Date from Date until Task Details
2017-07-10 2017-07-12 Task T170486 Some data was not inserted in MySQL, but was backfilled for all schemas but page-create. During the backfill, bot events were also accidentally backfilled, resulting in extra data during this time.
2017-05-24 onwards Task T67508 Do not accept data from bots on eventlogging unless bot user agent matches "MediaWiki".
2017-03-29 onwards Task T153207 Change userAgent field in event capsule

Incidents

Here's a list of all related incidents and their post-mortems. To add a new page to this generated list, use the "EventLogging/Incident_documentation" category.

For all the incidents (including ones not related to EventLogging) see: Incident documentation.

Limits of the eventlogging replication script

The log database is replicated to the eventlogging slave databases via a custom script, called eventlogging_sync.sh (script stored in operations/puppet for the curious). While working on https://phabricator.wikimedia.org/T174815 we realized that the script was not able to replicate high volume events in real time, showing a lot of replication lag (even days in the worst case scenario). Please review the task for more info or contact the Analytics team in case you have more questions.

See also