Analytics/Systems/Reportupdater

From Wikitech
Jump to: navigation, search


Intro and Value Proposition

reportupdater is a Python program that periodically executes given SQL queries (or scripts with SQL-like output) and appends their results into timeline TSV report files. If you have crons on any of the analytics data machines harvesting data you can likely benefit from using reportupdater. Think of it as a more "robust" way to execute cron jobs.

Some datasets generated by reportupdater: https://datasets.wikimedia.org/limn-public-data/metrics/

Some of its features are:

  • Detects which data points are missing in the reports and executes the corresponding queries to update those missing values.
  • Supports different report granularities: daily, weekly and monthly.
  • Allows parametrizing the query (or script) to generate several reports, one for each value of the parameter.
  • Generates consistent and comprehensive logs with different levels.
  • It is robust against interrupted or parallel execution, so report files don't get corrupted.
  • You can collect data from the MediaWiki databases, the EventLogging database and from the Hive cluster.

Download

Get code from: gerrit, code also mirrored to Wikimedia's Github.

Dependencies

You need python 2.7 to run reportupdater. To install dependencies, cd into reportupdater's root directory and execute pip install -r requirements.txt.

Queries and scripts

You should write a set of either SQL queries or executable scripts (or both combined) to be run by reportupdater. Regardless of which you choose, you must write code that returns a single data point. Then reportupdater will execute it for each specified date interval to build the whole timeline report. There are a couple conventions that your code should follow:

SQL Query conventions

  1. Use the placeholders {from_timestamp} and {to_timestamp} to timebound the results, for example: WHERE timestamp >= '{from_timestamp}' AND timestamp < '{to_timestamp}' (note that from_timestamp should be inclusive, while to_timestamp should be exclusive). Their format is YYYYMMDD000000.
  2. The first column of the results must be DATE('{from_timestamp}') AS date. This is an unnecessary limitation and might be removed in the future, but for now it's like that :/.
  3. There is no restriction on the number, format or contents of the subsequent columns.

Script conventions

  1. The first 2 parameters passed to the script are start_date and end_date, their format is YYYY-MM-DD. Use them to timebound the results (note that start_date should be inclusive, while end_date should be exclusive).
  2. The output of the report must be in TSV format, and must contain a first row with a header.
  3. The first column must be equal to start_date parameter (consider naming it date). This is an unnecessary limitation and might be removed in the future, but for now it's like that :/.
  4. There is no restriction on the number, format or contents of the subsequent columns.

Where to put these files?

You should put all queries and scripts inside the same dedicated directory. When executing reportupdater you'll pass that directory as an argument and reportupdater will know where to find your code.


Sample queries executed by other teams

Against eventlogging database:

https://github.com/wikimedia/analytics-discovery-stats

Against wiki databases: https://github.com/wikimedia/analytics-limn-ee-data

Config file

You also need to write a YAML config file that will tell reportupdater how to run the reports. There are four main sections in the config file:

The databases section

It specifies which databases to connect, it should include the host, the port, the credentials file and the database name, like this:

databases:
    anyDBId:
        host: some.sql.host
        port: 3306
        creds_file: /some/path/to/.my.cnf
        db: somedb
    anyOtherDBId:
        host: someother.sql.host
        port: 3306
        creds_file: /someother/path/to/.my.cnf
        db: someotherdb

If you're not using SQL queries, you can skip this section.

The graphite section

It specifies which graphite server to send stats to. It should include the host and the port. It can optionally include a dictionary of lookups, in which the key will be an identifier for a value in a metric execution. For example, if your metric explodes by wiki, you can provide a lookup dictionary for the identifier "wiki". This could be used to translate the wiki database name, like enwiki, to the wiki hostname, like en.wikipedia. The value in the lookups dictionary is a filename to load as that lookup. Or, if you'd like to translate values from a column in your report, you can provide a dictionary for "column1" for example.

graphite:
    host: some.graphite.host
    port: 1234
    lookupsː
        wikiː /some/path/to/wiki-lookup.yaml
        column1ː /some/path/to/column1-lookup.yaml

If you're not sending stats to graphite, you can skip this section.

The reports section

It describes the reports to be executed by reportupdater. For each report, there are three mandatory fields you must provide: the report id, the granularity and the start date; and there are some optional fields you can also specify, like this:

reports:
    anyReportId:
        granularity: days
        starts: 2016-01-01
    anyOtherReportId:
        granularity: months
        starts: 2014-06-15
        type: script
        lag: 10800
    yetAnotherReportId:
        granularity: weeks
        starts: 2000-01-01
        max_data_points: 10
        explode_by:
            dimension1: valueA, valueB, valueC
            dimension2: valueK, valueL
    graphiteReportIdː
        granularityː days
        funnelː true
        explode_byː
            wikiː enwiki, dewiki, eswiki
        graphite:
            path: "{_metric}.{column0}.{wiki}"
            metrics:
                "graphite.metric.name.prefix1": column1ToSendAsValue
                "graphite.metric.name.prefix2": column2ToSendAsValue
The report id 
It must match the name of the custom executable file. For example, if you're using a SQL query named random_report.sql, then the report id must be random_report (without .sql). If you're using a script, then name it the same as the report id. The report id also dictates the name of the generated report: if you use the id random_report, the resulting report will be named random_report.tsv.
granularity 
days, weeks or months. Depending on this parameter, reportupdater will execute your queries/scripts every day, week or month; passing the corresponding date range to them. Mandatory.
starts 
Indicates the first date to be computed (inclusive). Its format should be YYYY-MM-DD. Mandatory.
type 
sql or script. If sql, reportupdater will interpret your custom executable as SQL code. If script, it will execute your custom code via the command line. Optional. Default: sql.
lag 
Positive integer. The time to wait (in seconds) to start the execution of the query/script after the desired time interval has finished. It can be useful to wait until the source data is loaded into the database. Optional. Default: 0.
db 
String. Should be one of the database identifiers specified in the databases section. Indicates which database should reportupdater connect to in order to execute the query. It only works with sql reports, if your report is of type script, this option is ignored. Optional. Default: see defaults section.
funnel 
true or false. If true, the program will assume that the results of the query/script contain more than one line. If false it will assume only one-line results. This option should be named multiline instead of funnel, which is a terrible name. Optional. Default: false.
max_data_points 
Positive integer. The maximum of data points the report will contain. Note it depends on the granularity: If you have a weekly granularity and set max_data_points to 10, your report will hold the data for the last 10 weeks. Optional. If not set, the report will hold all data points since the specified starts parameter.
execute 
String. You may want to have two or more reports using the same query or script. You can do so by assigning a query or script name to the execute field of a report. Optional. If it is not set, reportupdater will execute the query/script that matches the report identifier.
explode_by 
Dictionary<string: list> (in YAML format). If you want to have N identical reports, one for each section of your product, you can configure it by using explode_by. For example: wiki: enwiki, dewiki, jawiki, frwiki would generate 4 reports, one for each wiki. Note that you would have to include the {wiki} placeholder in your SQL query to slice its results. If you're using a script, all explode_by values will be passed as arguments to the script (right after the date arguments) in alphabetical order. If you want, you can specify a file path instead of the value list; if you do so, reportupdater will read that file to get the value list (separate values using /n).
graphite 
Dictionary. If you want to send your report results to graphite, you must configure the metric names and values that will be sent. In this section, you define two things. First is a path template string. This is a python string template that will be filled in with values from a single report result row. This string will be formatted with values from the row itself (keyed by column name), values from the explode by section (keyed by the explode by key), and the special {_metric} described next. Second is a dictionary of metrics to send. The key in this dictionary will be substituted in the path string where you use {_metric}. And the value is the column to use for the actual value to send to graphite. For a complete example using graphite, see https://gerrit.wikimedia.org/r/#/c/322007/6/reports/config.yaml.

The defaults section

The values that you assign here will apply to all reports. For now, only the field db is supported.

defaults:
    db: anyDBId

How to execute?

Use the update_reports.py executable in reportupdater's root directory.

See source: [1]

usage: update_reports.py [-h] [--config-path CONFIG_PATH]
                         [--wikis_path WIKIS_PATH] [-l LOG_LEVEL]
                         query_folder output_folder

Periodically execute SQL queries or scripts and write/update the results into
TSV files.

positional arguments:
  query_folder          Folder with *.sql files and scripts.
  output_folder         Folder to write the TSV files to.

optional arguments:
  -h, --help            show this help message and exit
  --config-path CONFIG_PATH
                        Yaml configuration file. Default:
                        <query_folder>/config.yaml.
  -l LOG_LEVEL, --log-level LOG_LEVEL
                        (debug|info|warning|error|critical)

You'll find it useful as well to create a cron job that executes reportupdater periodically, for example every hour.


WMF-specific

How to test?

To test your queries against the real database, follow the instructions in test/locally/README.md.

How to productionize?

  1. Add your queries/scripts and config file to a new folder within this repo, and ask someone in the Analytics team to review.
  2. Ask the Analytics team to puppetize the execution of your reportupdater instance, or create a puppet patch yourself adding your reportupdater job next to here, and ask someone in the Analytics team to review.

Troubleshooting

Where are logs ?

For browser scripts:

In 1002 at /a/reportupdater

For multimedia data, on stat1003, not 1002:

/srv/reportupdater/log/limn-multimedia-data-multimedia.log

Where is puppet code?

https://github.com/wikimedia/puppet/tree/production/modules/reportupdater

https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/statistics/cruncher.pp

Support

reportupdater is maintained by WMF's Analytics team.

Next features

  • It would be cool to have a rerun_reports.py script to easily rerun reports for a given time interval.
  • Today the queries and scripts have to output a first column with the date. However this is not necessary and can be removed.
  • Also the configuration option is_funnel could be removed and reportupdater could identify automatically if the queries return 1 row or multiple rows and handle it properly.
  • Make all report config fields defaultable in the defaults section, like db.