Analytics/Systems/EventLogging/Backfilling

From Wikitech
Jump to navigation Jump to search

Backfilling a kafka eventlogging_<Schema> topic

It is possible that eventlogging_<Schema> topic is missing data (if there was a host issue) when event data is in fact in eventlogging-client-side topic. Since eventlogging-client-side is also imported into Hadoop you can run a local backfill evenlogging-processor on a stat machine (as of recent the eventlogging role is deployed there) and pump the data back in the topics in which it belongs.

  • Get files from Hadoop for the time intervals that are missing (note files are compressed but you want them to be in plain text)
hdfs dfs -text /wmf/data/raw/eventlogging_client_side/eventlogging-client-side/hourly/2019/04/02/21/* > 2019-04-21.txt
  • Clean up files a bit, they have an extra sequence number:
more 2019-04-01-19.txt | awk '{$1=""; print substr($0,2)}' | awk 'NF' > 2019-04-01-19-plain.txt
  • Checkout eventlogging to your homedir and after do:
export PYTHONPATH=/home/nuria/eventlogging
export PATH="$PATH:/home/nuria/eventlogging/bin"
eventlogging-processor --help
  • Once you have a file you want to pump to kafka do (careful with quotes, format is picky):

Need to whitelist proxy's

export http_proxy=http://webproxy.eqiad.wmnet:8080

export https_proxy=http://webproxy.eqiad.wmnet:8080

ionice cat 2019-04-01-19-plain.txt | eventlogging-processor '%q %{recvFrom}s %{seqId}d %D %{ip}i %u' 'stdin://' 'kafka- confluent:///kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafkajumbo1004.eqiad.wmnet:9092,kafkajumbo1005.eqiad.wmnet:9092,kafka.jumbo1006.eqiad.wmnet:9092topic=eventlogging_{schema}&message.send.max.retries=6,retry.backoff.ms=200' --output-invalid


  • After camus runs there should be new files on say this partition
ls -la /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_VirtualPageView/hourly/2019/04/01/20
  • You would need to run refine for data to be refined if backfilling is putting events more than 1 month in the past (an-coord1001):
sudo -u hdfs /usr/local/bin/refine_eventlogging_analytics --since 2019-04-01T19:00:00 --until 2019-04-01T21:00:00
  • Also events_sanitized might need to be backfilled


Get all events from kafka topics for a given timeframe

We need to get all events from timeframe from kafka, that can be done via consuming N messages from an offset The following command will print out message and offset

kafkacat -C -b kafka1012.eqiad.wmnet:9092  -c 1 -t eventlogging-client-side  -f "%o %s

The following command will return 12 messages at offsite 35...

kafkacat -C   -b kafka-jumbo1002.eqiad.wmnet:9092 -t eventlogging-client-side  -o 3597866000 -c12

Backfilling refine

If an interval of data is missing or corrupted in Hive's event database (eventlogging tables), we can re-run the refine process.

  1. Ssh into an-coord1001.eqiad.wmnet
  2. Execute systemctl list-timers | grep refine_eventlogging
  3. Choose the timer that corresponds to the data that you want to correct and run systemctl cat refine_eventlogging_<foo>
  4. The results will show you the script that is used to run the EventLogging refine in the ExecStart field, i.e.: ExecStart=/usr/local/bin/refine_eventlogging_analytics.
  5. Cat it and copy the command. You don't need to copy the is-yarn-app-running part of the command, just the spark2-submit part. You can also leave the "${@}" out.
  6. You might want to add sudo -u hdfs/analytics in front of it. It's a good idea to change the name as well, like prefixing it with backfill_. Add --since and --until parameters, to override the properties file. If you're not sure if you need to override other properties (like emails_to or table_blacklist_regex, etc.), get the properties file path from the spark2-submit command and cat it.
  7. Run your tailored command in an-coord1001.eqiad.wmnet.

Backfilling sanitization

If an interval of data is missing, corrupted or out of date in Hive's event_sanitized database, we can re-run the sanitization process. Note that EL sanitization does already a second pass 45 days after data collection. So if the data that you want to backfill is not older than 45 days, you don't need to backfill it (will be done automatically after 45 days), unless it's urgent!

  1. Ssh into an-coord1001.eqiad.wmnet
  2. Execute systemctl cat refine_sanitize_eventlogging_analytics_delayed
  3. The results will show you the script that is used to run the EventLogging sanitization in the ExecStart field, i.e.: ExecStart=/usr/local/bin/refine_sanitize_eventlogging_analytics_delayed.
  4. Cat it and copy the command. You don't need to copy the is-yarn-app-running part of the command, just the spark2-submit part. You can also leave the "${@}" out.
  5. You might want to add sudo -u hdfs/analytics in front of it. It's a good idea to change the name as well, like prefixing it with backfill_. Add --since and --until parameters, to override the properties file. If you're not sure if you need to override other properties (like emails_to or table_blacklist_regex, etc.), get the properties file path from the spark2-submit command and cat it.
  6. Run your tailored command in an-coord1001.eqiad.wmnet.

Backfilling mysql

From eventlog1002.eqiad.wmnet, find a log file that you want to backfill from, and optionally zcat parts out of it (for example, by piping through head / tail: | tail -n +2000000 | head -n 10000). Let's say we picked /srv/log/eventlogging/archive/all-events.log-20151015.gz and wanted to backfill only the first 100 events:

ionice zcat /srv/log/eventlogging/archive/all-events.log-20151015.gz | head -n 100 | python eventlogging-consumer stdin:// 'mysql://eventlog:<<password found in /etc/eventlogging.d/consumers/mysql-m4-master>>@m4-master.eqiad.wmnet/log?charset=utf8&statsd_host=statsd.eqiad.wmnet&replace=True'

Loading a full file:

eventlogging@eventlog1001:/home/otto$ cat Translation_Beta_Events.json | python /srv/deployment/eventlogging/analytics/bin/eventlogging-consumer stdin:// 'mysql://xxx:xxx@m4-master.eqiad.wmnet/log?charset=utf8&statsd_host=statsd.eqiad.wmnet&replace=True'

This can be run from the eventlogging machine but also from 1002, if you are running this from 1002 you might need to set:

export PYTHONPATH=/srv/deployment/eventlogging/eventlogging
export DJANGO_SETTINGS_MODULE='THIS IS A BUG'

And qualify all paths, like:

/usr/bin/python /srv/deployment/eventlogging/eventlogging/bin/eventlogging-consumer @/home/some/consumer-config

Old way (kept as it may be useful)

This document explains how to backfill EL data as of 2015-02-16 . Please note that any redesigns of the service will affect how backfilling is done.

Prerequirements

You need sudo on Beta Cluster to test the backfilling scripts and also sudo on eventlog1002 to do the backfilling for real: EventLogging/Testing/BetaCluster

This document describes how to do backfilling from "processed" events. If you need to backfill from raw events, like the ones stored on the client side log, additional steps are needed. The idea is the same only that a "process" step needs to be included so raw events can be processed before inserted on db.

Note that from this change onwards: [1] eventlog1002 only has logs for the last 30 days so backfilling of an outage should be done as soon as possible

First step (data preparation)

In the first step, split the logs for the relevant day into files of 64K lines. This size ensures you don't go over memory issues.

(Having such small files gives good control over what timespan you want to backfill, and it allows for easy parallelization, speed-up, and fine-control during data injection).

Events can be split with a command like this:

 mkdir split && cd split && ionice nice zcat /srv/log/eventlogging/archive/all-events.log-20141114.gz >all-events.log && ionice nice split --lines=64000 all-events.log && rm all-events.log

Raw Events

If you need to backfill raw events you might find this snippet useful: https://gist.github.com/nuria/e837d16b94c09a4df8a4 raw events logs (client-side and server-side) include a bunch of characters that need to be removed to be processed by the processors.

Second step (data injection)

You should test your scripts and code in Beta Cluster before trying this on vanadium.

Checkout a separate clone of EventLogging

The injection is better done using a separate clone of EventLogging. That way the backfilling is not subjected to interruptions of eventual EventLogging deployments of others, and you can use could use an EventLogging version of your choice.

See for example changes done prior to be able to backfill events 1 by 1 (not batched): [2]

To run EventLogging from your local checkout you need to change the python library search path. So, if you checked out EL code in your home directory, you would need to tell python where to build it:

cd ~/EventLogging/server
export  PYTHONPATH='/home/nuria/backfilling/python'
python ./setup.py develop --install-dir=/home/nuria/backfilling/python

These command build EL to `/home/nuria/backfilling/python`

Start a Backfilling Consumer

In a simple for loop over those split files (or parts of them in parallel), start a separate EventLogging consumer (that consumes from stdin and writes to m2-master) and pipe the file in. The config for this EventLogging consumer is just a copy of the m2 consumer's config having it's input swapped by the stdin. I would rename this config so when running htop is easy to find the process:

Config looks as follows:

nuria@vanadium:~/backfilling$ more mysql-m2-master-BACKFILLING
stdin://
mysql://some-connection-string?charset=utf8&replace=True

Note that the regular consumer batches events. Using that code as is to backfill is fine if you a are dealing with a total outage. If you have a problem with dropped out events within the event stream you cannot batch insertion. Thus , you might need to do code changes to the consumer to be able to backfill:

I had to do these changes on 201502: https://gerrit.wikimedia.org/r/#/c/190139/

To try whether your changes are working (again, in Beta Cluster)

/usr/bin/python -OO ./python/eventlogging-consumer @/home/nuria/backfilling/mysql-m2-master-BACKFILLING  > log-backfill.txt 2>&1

For each of the started consumers (I could only start two without the db falling too much behind), capture stdout and stderr and exit code to separate (per input) files. This allowed to easily verify that backfilling did not bail out and correlate log files with input files.


A simple shell script to loop over files and consume each:

 #!/bin/bash

 fileList=`ls 20150208/x*`

 for f in $fileList
 do
    l="${f##*/}"
    ionice nice cat $f | ionice nice  /usr/bin/python -OO ./python/eventlogging-consumer @/home/nuria/backfilling/mysql-m2-master-BACKFILLING  > log-backfill-${l}.txt 2>&1
    rm log-backfill-${l}.txt
 done

Monitoring

There are two things to monitor: the database and eventlogging hosts. You can monitor eventlogging hosts with htop, the database stats appear here: https://tendril.wikimedia.org/host/view/db1046.eqiad.wmnet/3306