Data Engineering/Systems/Reportupdater

From Wikitech
Jump to navigation Jump to search

reportupdater is a Python program that periodically executes given SQL queries (or scripts with SQL-like output) and appends their results into timeline TSV report files. If you have crons on any of the analytics data machines harvesting data you can likely benefit from using reportupdater. Think of it as a more "robust" way to execute cron jobs.

Some datasets generated by reportupdater: https://analytics.wikimedia.org/datasets/periodic/reports/metrics/

Some of its features are:

  • Detects which data points are missing in the reports and executes the corresponding queries to update those missing values.
  • Supports different report granularities: daily, weekly and monthly.
  • Allows parametrizing the query (or script) to generate several reports, one for each value of the parameter.
  • Generates consistent and comprehensive logs with different levels.
  • It is robust against interrupted or parallel execution, so report files don't get corrupted.
  • You can collect data from the MediaWiki databases, the EventLogging database and from the Hive cluster.

Download

Get code from gerrit. The code also mirrored to Wikimedia's GitHub.

Dependencies

You need python 3 to run reportupdater.

To install dependencies, cd into reportupdater's root directory and execute pip install -r requirements.txt.

Queries and scripts

You should write a set of either SQL queries or executable scripts (or both combined) to be run by reportupdater. Regardless of which you choose, you must write code that returns a single data point per metric, per day. For example, if splitting by wiki then the metric path should include {wiki} and the output would include exactly one row for each value of the wiki column. Additional dimensions can be included as columns in the output, and should be included in the metric path, for example Foo.byEditCount.{edit_count_bucket}.byWiki.{wiki}.

Reportupdater will execute the query for each specified date interval to build the whole timeline report.

There are a couple conventions that your code should follow:

SQL Query conventions

  1. Use the placeholders {from_timestamp} and {to_timestamp} to timebound the results, for example: WHERE timestamp >= '{from_timestamp}' AND timestamp < '{to_timestamp}' (note that from_timestamp should be inclusive, while to_timestamp should be exclusive). Their format is YYYYMMDD000000.
  2. The first column of the results must be DATE('{from_timestamp}') AS date. This is an unnecessary limitation and might be removed in the future, but for now it's like that :/.
  3. There is no restriction on the number, format or contents of the subsequent columns.

Script conventions

  1. The script file MUST be marked as EXECUTABLE, otherwise it will give a cryptic permission denied error
  2. The first 2 parameters passed to the script are start_date and end_date, their format is YYYY-MM-DD. Use them to timebound the results (note that start_date should be inclusive, while end_date should be exclusive).
  3. The output of the report must be in TSV format, and must contain a first row with a header.
  4. The first column must be equal to start_date parameter (consider naming it date). This is an unnecessary limitation and might be removed in the future, but for now it's like that :/.
  5. There is no restriction on the number, format or contents of the subsequent columns.

Where to put these files?

You should put all queries and scripts inside the same dedicated directory. When executing reportupdater you'll pass that directory as an argument and reportupdater will know where to find your code. If you're wanting to host your datasets at https://analytics.wikimedia.org/datasets/periodic/reports/metrics/, put your queries and scripts in a repo and then ask the Analytics team to clone them to wherever they clone them to.

Sample queries executed by other teams

Against eventlogging database:

https://github.com/wikimedia/analytics-discovery-stats

Config file

You also need to write a YAML config file that will tell reportupdater how to run the reports. There are four main sections in the config file:

The databases section

If you're not using SQL queries, you can skip this section. The databases section specifies which databases to connect. Usually, it should include the host, the port, the credentials file and the database name, like this:

databases:
    any_db_id:
        host: analytics-slave.eqiad.wmnet
        port: 3306
        creds_file: /etc/mysql/conf.d/research-client.cnf
        db: log

But if your queries are hitting the analytics replicas (former dbstore1002 or analytics-store) the configuration should be different. The analytics replicas have been restructured for performance and maintainability, and they are now divided in shards. RU provides a specific configuration option, so you don't need to deal with shards. You should set `auto_find_db_shard` to true, and drop `host` and `port`, like this:

databases:
    analytics_replica:
        auto_find_db_shard: true
        creds_file: /some/path/to/.my.cnf
        db: enwiki

Gotcha: if the table that you're querying belongs to an extension (i.e. echo tables), it will be in a separate shard called x1. So you have to indicate it by setting `use_x1` to true:

databases:
    analytics_replica:
        auto_find_db_shard: true
        use_x1: true
        creds_file: /some/path/to/.my.cnf
        db: enwiki

Now, if you are using the `explode_by` feature, i.e. with `explode_by: wiki`, then a single query is connecting to several databases. The way to specify that would be to set `wiki_db_placeholder` to the name of the placeholder you're exploding the database names (in this case `wiki`). Also, you should remove the `db` parameter:

databases:
    analytics_replica:
        auto_find_db_shard: true
        wiki_db_placeholder: wiki
        creds_file: /some/path/to/.my.cnf

The graphite section

It specifies which graphite server to send stats to. It should include the host and the port. It can optionally include a dictionary of lookups, in which the key will be an identifier for a value in a metric execution. For example, if your metric explodes by wiki, you can provide a lookup dictionary for the identifier "wiki". This could be used to translate the wiki database name, like enwiki, to the wiki hostname, like en.wikipedia. The value in the lookups dictionary is a filename to load as that lookup. Or, if you'd like to translate values from a column in your report, you can provide a dictionary for "column1" for example.

graphite:
    host: some.graphite.host
    port: 1234
    lookups:
        wiki: /some/path/to/wiki-lookup.yaml
        column1: /some/path/to/column1-lookup.yaml

If you're not sending stats to graphite, you can skip this section.

The reports section

It describes the reports to be executed by reportupdater. For each report, there are three mandatory fields you must provide: the report id, the granularity and the start date; and there are some optional fields you can also specify, like this:

reports:
    anyReportId:
        granularity: days
        starts: 2016-01-01
    anyOtherReportId:
        granularity: months
        starts: 2014-06-15
        type: script
        lag: 10800
    yetAnotherReportId:
        granularity: weeks
        starts: 2000-01-01
        max_data_points: 10
        explode_by:
            dimension1: valueA, valueB, valueC
            dimension2: valueK, valueL
    graphiteReportId:
        granularity: days
        explode_by:
            wiki: enwiki, dewiki, eswiki
        graphite:
            path: "{_metric}.{column0}.{wiki}"
            metrics:
                "graphite.metric.name.prefix1": column1ToSendAsValue
                "graphite.metric.name.prefix2": column2ToSendAsValue
The report id
It must match the name of the custom executable file. For example, if you're using a SQL query named random_report.sql, then the report id must be random_report (without .sql). If you're using a script, then name it the same as the report id. The report id also dictates the name of the generated report: if you use the id random_report, the resulting report will be named random_report.tsv.
granularity
days, weeks or months. Depending on this parameter, reportupdater will execute your queries/scripts every day, week or month; passing the corresponding date range to them. Mandatory.
starts
Indicates the first date to be computed (inclusive). Its format should be YYYY-MM-DD. Mandatory.
type
sql or script. If sql, reportupdater will interpret your custom executable as SQL code. If script, it will execute your custom code via the command line. Optional. Default: sql.
lag
Positive integer. The time to wait (in seconds) to start the execution of the query/script after the desired time interval has finished. It can be useful to wait until the source data is loaded into the database. Optional. Default: 0.
db
String. Should be one of the database identifiers specified in the databases section. Indicates which database should reportupdater connect to in order to execute the query. It only works with sql reports, if your report is of type script, this option is ignored. Optional. Default: see defaults section.
max_data_points
Positive integer. The maximum of data points the report will contain. Note it depends on the granularity: If you have a weekly granularity and set max_data_points to 10, your report will hold the data for the last 10 weeks. Optional. If not set, the report will hold all data points since the specified starts parameter.
execute
String. You may want to have two or more reports using the same query or script. You can do so by assigning a query or script name to the execute field of a report. Optional. If it is not set, reportupdater will execute the query/script that matches the report identifier.
explode_by
Dictionary<string: list> (in YAML format). If you want to have N identical reports, one for each section of your product, you can configure it by using explode_by. For example: wiki: enwiki, dewiki, jawiki, frwiki would generate 4 reports, one for each wiki. Note that you would have to include the {wiki} placeholder in your SQL query to slice its results. If you're using a script, all explode_by values will be passed as arguments to the script (right after the date arguments) in alphabetical order. If you want, you can specify a file path instead of the value list; if you do so, reportupdater will read that file to get the value list (separate values using /n).
graphite
Dictionary. If you want to send your report results to graphite, you must configure the metric names and values that will be sent. In this section, you define two things. First is a path template string. This is a python string template that will be filled in with values from a single report result row. This string will be formatted with values from the row itself (keyed by column name), values from the explode by section (keyed by the explode by key), and the special {_metric} described next. Second is a dictionary of metrics to send. The key in this dictionary will be substituted in the path string where you use {_metric}. And the value is the column to use for the actual value to send to graphite. For a complete example using graphite, see https://gerrit.wikimedia.org/r/#/c/322007/6/reports/config.yaml.

The defaults section

Any value that you assign here will apply to all reports. If a report also has a value for that same key, it'll override the default value.

The examples below show a config with defaults compared to an equivalent config without defaults:

With defaults
 defaults:
     type: script
     starts: 2018-01-01
     explode_by:
         wiki: wikis.txt
 
 daily_report:
     granularity: days
 
 weekly_report:
     granularity: weeks
 
 monthly_report:
     granularity: months
Without defaults
daily_report:
    type: script
    starts: 2018-01-01
    granularity: days
    explode_by:
        wiki: wikis.txt

weekly_report:
    type: script
    starts: 2018-01-01
    granularity: weeks
    explode_by:
        wiki: wikis.txt

monthly_report:
    type: script
    starts: 2018-01-01
    granularity: months
    explode_by:
        wiki: wikis.txt

How to execute?

Use the update_reports.py executable in reportupdater's root directory.

See source.

usage: update_reports.py [-h] [--config-path CONFIG_PATH]
                         [--wikis_path WIKIS_PATH] [-l LOG_LEVEL]
                         query_folder output_folder

Periodically execute SQL queries or scripts and write/update the results into
TSV files.

positional arguments:
  query_folder          Folder with *.sql files and scripts.
  output_folder         Folder to write the TSV files to.

optional arguments:
  -h, --help            show this help message and exit
  --config-path CONFIG_PATH
                        Yaml configuration file. Default:
                        <query_folder>/config.yaml.
  -l LOG_LEVEL, --log-level LOG_LEVEL
                        (debug|info|warning|error|critical)

You'll find it useful as well to create a cron job that executes reportupdater periodically, for example every hour.

Re-runs

Sometimes, at the time of the queries, the data sources are incomplete or corrupt. In those cases, reportupdater might generate wrong values for a given date. If this happens, you can use the re-run functionality of reportupdater: rerun_reports.py executable in the root directory.

usage: rerun_reports.py [-h] [--config-path CONFIG_PATH] [-r REPORT]
                        query_folder start_date end_date

Mark reports to be re-run for a given date range.

positional arguments:
  query_folder          Folder with *.sql files and scripts.
  start_date            Start of the date range to be rerun (YYYY-MM-DD,
                        inclusive).
  end_date              End of the date range to be rerun (YYYY-MM-DD,
                        exclusive).

optional arguments:
  -h, --help            show this help message and exit
  --config-path CONFIG_PATH
                        Yaml configuration file. Default:
                        <query_folder>/config.yaml.
  -r REPORT, --report REPORT
                        Report to be re-run. Several reports can be specified
                        like this. If none is specified, all reports listed in
                        the config file are marked for re-run.

Note that the execution of rerun_reports.py does not re-run the reports per se. It only marks given reports and dates so that update_reports.py can re-run them in its next execution. Please, do not alter report files manually! Doing so can result in permission problems.

Example: rerunning all history for the active-boards report from the limn-flow-data repository, in the flow folder:

python /srv/reportupdater/reportupdater/rerun_reports.py -r active-boards /srv/reportupdater/jobs/limn-flow-data/flow 2009-11-08 2017-12-11

WMF-specific

How to test?

To test your queries against the real database, just run the queries there before submiting a code patch.

To test on any analytics client machine:

  • you need both repositories cloned: reportupdater-queries and reportupdater
  • get the reportupdater-queries changeset you are interested in testing
  • if you're testing Hive queries, make sure you've authenticated through kinit
  • if you're testing queries to MariaDB databases, make sure the credentials file is pointed correctly
  • /usr/bin/python3 /path/to/reportupdater/update_reports.py -l info /path/to/reportupdater-queries/<your directory>/ ~/reportupdater-output
  • It might be necessary to execute using the analytics user, if you do so, give permits to the reportupdater-queries repo, so that the pid file can be created.

How to productionize?

  1. Add your queries/scripts and config file to a new folder within this repo, and ask someone in the Analytics team to review.
  2. Ask the Analytics team to puppetize the execution of your reportupdater instance, or create a puppet patch yourself adding your reportupdater job in [1], and ask someone in the Analytics team to review. There are two puppet modules that run jobs, one for mysql and one for hive.

Troubleshooting

Where are logs ?

For reports running on cron, logs are written to /srv/reportupdater/log. For reports running on systemd timers, see how logs are accessed at Analytics/Systems/Managing_systemd_timers.

Where is puppet code?

Support

reportupdater is maintained by WMF's Analytics team.

Next features

  • Today the queries and scripts have to output a first column with the date. However this is not necessary and can be removed.
  • Make all report config fields defaultable in the defaults section, like db.