Data Engineering/Tutorials/Dashboards

From Wikitech

NOTE: Dashiki hasn't been maintained in a while, and therefore only builds with Node version 6.17.1. We are evaluating whether to deprecate or refresh it going forward.

This page describes how to easily build dashboards that visualize statistics on WMF's data sources using Analytics' pipeline and tools.

Overview

Supported data sources

Analytics' dashboarding pipeline supports all data that can be accessed through stat1007.eqiad.wmnet and stat1006.eqiad.wmnet. Examples include:

  • MediaWiki database replicas
  • EventLogging database replica
  • Hive tables on top of pageview, search, edit, and other data

Supported layouts and visualizations

Different dashboard layouts allow you to present different types of data. Available layouts are:

  • The tabs layout lets you organize your visualizations in a tab navigation menu. See: User agent breakdowns
  • The metrics-by-project layout addresses the problem of showing stats on multiple projects and languages on the same visualization. See: Vital signs
  • The compare layout can help you if your metrics are split into 2 realms that are to be compared, for example: Wikitext editor vs. VisualEditor. See: Edit analysis

These layouts can be customized to use the available visualizations:

  • Line chart
  • Tabular text data
  • Sunburst chart
  • Stacked bar chart

New layouts and visualizations can be added fairly quickly. The Analytics team will do this for you if necessary.

Necessary knowledge

You don't need to be a software developer to create dashboards using this pipeline. However a basic know-how of the following concepts will be needed:

  • SQL to adapt your queries to the expected conventions
  • Git and Gerrit for downloading tools and submitting changes for review

Tools used

  • Reportupdater It's a python program that periodically executes custom SQL queries against given data sources and progressively builds TSV reports.
  • Dashiki It's a serverless dashboarding application written in JavaScript that visualizes TSV reports (among other data sources) and is configured on wiki.

Work time and calendar time expectation

The time it takes to have a dashboard up and running depends on the number of visualizations that you want to include in your dashboard. Also, if this is the first time you use this pipeline, you'll have to spend some time reading this documentation, familiarizing yourself with the tools, creating a repository and deploying a dashboard instance. Finally, the process includes a code review by the Analytics team. As a rough approximation consider: 2-3h getting started + (1-2h work * #visualizations) + 1-2d waiting for Analytics + 1-2h testing and deploying.

Step by step

The following section explains in detail what you need to do to spin up a dashboard using Analytics dashboarding pipeline. It assumes that you already have one or more SQL queries that work on one of the supported data sources. The steps can be divided into 2 blocks: generating reports and generating visualizations. In case you already have TSV reports in analytics.wikimedia.org/datasets/, please skip the Generating reports section and go directly to Generating visualiations.

Generating reports

Adapt your SQL queries to reportupdater's conventions

Reportupdater needs the queries to return a single data point. So modify your queries to look like:

SELECT
    DATE('{timestamp_from}') AS date,
    COUNT(*) AS metric_1,
    SUM(some_value) AS metric_2
FROM some_table
WHERE
    timestamp >= '{timestamp_from}' AND
    timestamp < '{timestamp_to}'
GROUP BY date;

Note that this query is using 2 reportupdater placeholders: {timestamp_from} and {timestamp_to}. Reportupdater will replace those with the timestamps of a given time interval. The first column the query selects should be named date and should be formatted as YYYY-MM-DD. So the final output of the query, for a given timestamp_from = 20160101000000 and timestamp_to = 20160102000000 will look like this:

date        metric_1    metric_2
2016-01-01  34          2532

This is likely to be the only adaptation you need. But you might also want to use reportupdater's explode by wiki feature. With this option enabled, reportupdater will execute your query for each wiki in a list and generate a separate report for each one of them. To enable explode by wiki, put a new placeholder named {wiki_db} in the FROM statement of your query, like this:

FROM {wiki_db}.some_table

If you want to dig deeper into all reportupdater features and options, please take a look at the full reportupdater documentation.

Write the configuration for your reports

Create a folder and put all your queries in it. Also, create a new file in it named config.yaml. The folder structure should look like:

query_folder
    |__ query_1.sql
    |__ query_2.sql
    ...
    |__ query_N.sql
    \__ config.yaml

The config.yaml file will contain the pieces of information reportupdater needs to execute the queries and create the reports. It has 3 sections: databases, defaults and reports. This is an example of a config.yaml file:

databases:
    el:
        host: "analytics-store.eqiad.wmnet"
        port: 3306
        creds_file: /a/.my.cnf.research
        db: log

defaults:
    db: el

reports:
    query_1:
        granularity: days
        starts: 2015-04-01
    query_2:
        granularity: weeks
        starts: 2016-02-01
        explode_by:
            wiki_db: enwiki, dewiki, frwiki

In the databases section, specify the host and port you want to connect to, point to the credentials file and give a default database. In the defaults section, specify that the default db is the one you just configured. And in the reports section, write a config block for each query you want to run. For each report, you have to specify granularity and start date. If you are using the explode by wiki feature, indicate which wiki databases you want reportupdater to execute your queries on. Please, look at the full reportupdater documentation for more detail.

Test locally

A great way to know if your queries and configuration will work in production is testing the generation of your TSV report files on your local machine. To do that, you have to:

  1. Clone reportupdater repository.
  2. In a separate terminal, run ssh -L 3307:analytics-store.eqiad.wmnet:3306 stat1006.eqiad.wmnet to create an ssh tunnel to the database host. Replace analytics-store.eqiad.wmnet with your host if necessary.
  3. Create a credentials file as explained in the README and put it into reportupdater/test/locally/ folder. Warning: never push the credentials file to Gerrit!
  4. Modify the database section in your config.yaml file to point to localhost:3307, and to point to the credentials file you just created. You can copy the example shown in reportupdater/test/locally/config_example.yaml. Don't forget to revert that when you're finished testing.
  5. Run reportupdater with python update_reports.py /path/to/your/query/folder/ /tmp/output -l info.

You should see reportupdater running and outputing info logs. If everything goes well, the report files will be written to /tmp/output. Note that this test does not work with queries to the Hive cluster! If this feels too complicated, please let us Analytics know and we'll help you :-).

Deploy reportupdater job

Once reportupdater is able to execute your queries, you can submit a change to gerrit. Add your folder containing the queries and config to the root level of reportupdater-queries repository. If you already have a separate query repository or want to use your own new repository, that's OK. In any case, add someone in Analytics as a reviewer of your patch and ping us. We will review the code and merge it. We'll also create a trigger in puppet to launch reportupdater for you. If everything goes well, after a couple days, you should see your report files made public in analytics.wikimedia.org/datasets/periodic/reports.

Generating visualizations

Write the configuration for your dashboard

Dashiki dashboards are configured on wiki, so you'll need to create a wiki page on meta.wikimedia.org and fill in some lines of JSON configuration. You should name the page Config:Dashiki:NameOfYourDashboard. Depending on which Dashiki layout you choose, your configuration will be different.

Config for the tabs layout

The most generic Dashiki layout is the tabs layout. With it you can organize your visualizations under a tabs navigation menu. Looks like this: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os

To use it, write a JSON config following this example:

{
    "title": "Dashboard Title",
    "subtitle": "Dashboard Subtitle",
    "desc": "A description of your dashboard",
    "tabs": [
        {
            "title": "Tab 1 Title",
            "dataRange": {
                "startDate": "2015-06-01" // graphs will show data from this day on
            },
            "graphs": [
                {
                    "title": "Graph 1 title",
                    "type": "dygraphs-timeseries", // visualizer name
                    "path": "path/to/report/file1.tsv", // relative to analytics.wikimedia.org/datasets/periodic
                    "format": "percent" // number format
                }, {
                    "title": "Graph 2 title",
                    "type": "hierarchy",
                    "path": "path/to/report/file2.tsv",
                    "pivot": {
                        "dimension": "column to pivot by",
                        "metric": "column to aggregate (if needed) when pivoting"
                    }
                },
                ...
            ]
        }, {
            "title": "Tab 2 title",
            "graphs": [
                ...
            ]
        },
        ...
    ]
}

You can find a full working example in the Sample/tabs dashboard config. Note that the //comments and ... in the JSON code above are just explanatory, and they need to be removed.

In the graph-type field you can specify one of: dygraphs-timeseries, table-timeseries or hierarchy. Choose dygraphs-timeseries if you want to graph your data as a line chart. Tip: each column in your report (except the first that holds the date) will become a line in the chart. Choose table-timeseries if you want to graph your data as a table. Or choose hierarchy if you prefer graphing your data as a sunburst hierarchical chart. In the latter, your report file needs to have a cube-like structure with a header like: date dimension1 dimension2 ... dimensionN metric, where dimension values are categorical strings and metric is a number. In the graph-path field you have to write the path to the report file relative to https://analytics.wikimedia.org/datasets/periodic/reports/metrics/.

Config for the metrics-by-project layout

If you have a metric that you want to visualize for all wikimedia projects (all 200+) this is the layout you want to use, it looks like this: https://analytics.wikimedia.org/dashboards/vital-signs/#projects=ruwiki,frwiki/metrics=Pageviews

This Dashiki layout is very useful for reports that use the explode by wiki feature (see: Generating reports). It will let the users select the wikis they want to add to the chart in a friendly way. To use it, you have to add a bit of JSON config to an existing page in meta.wikimedia.org first: Dashiki:CategorizedMetrics. Open it for editing and add a config block for your queries:

{
    "name": "Category your metrics belong to",
    "metrics": [
        {
            "definition": "https://meta.wikimedia.org/wiki/Some_page_that_explains_your_metric",
            "name": "A Readable Metric Name",
            "metric": "name_of_your_query_folder",
            "submetric": "name_of_your_query",
            "api": "datasets"
        },
        ...
    ]
}

Be careful not to write any JSON incompatible code, otherwise other dashboards may break. You can check if everything is OK by visiting this dashboard. If you see data, everything's fine; otherwise, please revert your change to Dashiki:CategorizedMetrics and try again. After that, you have to write the actual configuration for your dashboard in Config:Dashiki:NameOfYourDashboard page, but this one is really short:

{
    "defaultProjects": [
        "enwiki", "dewiki", "frwiki"
    ],
    "defaultMetrics": [
        "Metric Name 1"
    ],
    "metrics": [
        "Metric Name 1", "Metric Name 2", "Metric Name 3"
    ]
}

Note that you can only list as much defaultProjects as wiki files exist in your reports folder. And the metric names should be specified in the Dashiki:CategorizedMetrics page.

Config for the compare layout

The compare layout works well when you have two families of reports that share the same metrics, and you want to compare them. An example could be an A/B test, where you want to compare the metrics of both groups A and B. Dashiki expects the report files to be structured like this:

https://analytics.wikimedia.org/datasets/periodic/reports/metrics
    |__query_name_1
    |   |__group_name_1
    |   |   |__enwiki.tsv
    |   |   |__dewiki.tsv
    |   |   \__frwiki.tsv
    |   \__group_name_2
    |       |__enwiki.tsv
    |       |__dewiki.tsv
    |       \__frwiki.tsv
    |__query_name_2
    |   |__group_name_1
    |   |   |__enwiki.tsv
    ... ... ...
    

Note that this layout is also designed specifically for the explode by wiki feature! So write a configuration that follows this example:

{
    "a": "group_name_1",
    "b": "group_name_2",
    "startDate": "2015-04-01",
    "comparisons": [
        {
            "title": "Readable Metric Name 1",
            "type": "timeseries",
            "metric": "query_name_1",
            "desc": "Description of the chart 1"
        },
        {
            "title": "Readable Metric Name 2",
            "type": "timeseries",
            "metric": "query_name_2",
            "desc": "Description of the chart 2"
        },
        ...
    ]
}

Test locally

Once you have the configuration in place on meta.wikimedia.org, you can easily test the dashboard on your local machine (your desktop, no VM needed) following these steps:

  1. You need npm installed in your machine.
    1. In addition to npm, the two packages bower and gulp need to be installed and available from the command line.
  2. Clone Dashiki repository and cd into it.
  3. Execute npm install to setup the dependencies.
  4. Executecd semantic && gulp build.
  5. Build the project with gulp --layout <LAYOUT> --config <CONFIG>. Where <LAYOUT> is either tabs, metrics-by-project or compare; and <CONFIG> is the title of your config page in meta.wikimedia.org without the Config: prefix.
  6. Spin up a local file server, for example by executing python -m SimpleHTTPServer 5000 from the dist subdirectory that contains the output generated by gulp.
  7. Visit localhost:5000 on your browser to see your dashboard, hopefully full of colorful charts :-)

Deploy your dashboard

Once you've tested that the dashboard works in your local machine, you can deploy it for other people to see. Add a block like this one to the end of the config.yaml file living in dashiki's root directory:

name-of-your-dashboard:
    layout: tabs
    config: MyDashboardConfig
    hostname: name-of-your-dashboard-test.wmflabs.org

And submit a patch to Dashiki's repository in Gerrit. We Analytics will review, merge and deploy it for you!

Support

Even when this documentation is not amazing by far, we Analytics will love to help you in configuring your dashboards with this pipeline. So, for any question, feedback or comment on this process, please send an email to the Analytics mailing list or open a ticket on Phabricator tagging it with the Analytics project. Thanks!