Analytics/Systems/Superset

From Wikitech
Jump to navigation Jump to search

Apache Superset enables visualizations and dashboards built from various analytics data sources.

Our primary Superset instance is v1.5.3 and can be found at superset.wikimedia.org.

We also run a staging instance which is located at superset-next.wikimedia.org - This is used to test new versions of and features of Superset before they are promoted to the primary instance.

Like Turnilo, it provides access to various Druid tables, as well as data in Hive (and elsewhere) via Presto.

Access

To access Superset, you need wmf or nda LDAP access. For more details, see Analytics/Data access#LDAP access. Once done, you can log in using your Wikitech username and password.

Usage notes

  • The "Druid Datasources" list shows ingested tables that are available for querying. As of October 2018, this includes e.g. daily and hourly pageviews data (the daily version is only updates once a month, but goes further back), a sampled excerpt of webrequest data, unique devices, and a few select EventLogging schemas. If a recently created Druid datasource is not yet visible in the list, try clicking "Scan New Datasouces".
  • NULL values don't show up properly in the values selection dropdown list for filters (i.e. one can't use that dropdown to exclude NULL values from a chart or limit it to NULL values). But one can use the regex option instead: Type in ".+" (without the quotes), and accept the offer to create that as an option.
  • By default, always use predefined SUM metrics when available. When choosing a metric then picking the SUM aggregation function, the aggregation is managed by superset and uses the floatSum operator. This operator uses 32 bits floats instead of 64 bits longs or double, leading to inaccuracies. Usually predefined SUM(...) metrics are available and should be used, as they are manually defined using doubleSum or longSum 64 bits operators.
  • If you build a chart based on a table with structs, you won't be able to access the fields of the struct because Superset recognizes the struct as a single string column. The workaround is to add a computed column with struct.field as the SQL expression.
  • Superset expects time columns to be in SQL timestamp string format (2021-01-01 00:00:00) and has trouble with columns in ISO 8601 string format (2021-01-01T00:00:00Z). To fix this, it's best to create a computed column that casts the time to the right format using an SQL expression like (CAST(TO_ISO8601_TIMESTAMP(dt) AS VARCHAR).
  • When creating filters for the charts on your dashboard, please do not create Filter Box charts. Instead, use Native filters available on the Dashboard Filter sidebar. For detailed information see here https://docs.preset.io/docs/dashboard-filtering

Advanced time ranges

Superset accepts a lot of different things in advanced time ranges boxes, including SQL time functions and any human-language times accepted by the Parsedatetime library.Unfortunately, neither Superset nor Parsedatetime provides any documentation of this, so there's no actual list of all the possibilities. But when you put things in, Superset does at least give you immediate feedback about whether they're valid and what they're interpreted as. You can also use a variety of helpful date functions such as DATETRUNC and DATEADD.

Here are some useful examples:

desired range start time value end time value
the last 26 full Monday-Sunday weeks

(useful for a weekly graph covering the last half year)

this Monday 26 weeks ago this Monday
up to yesterday

(useful when working with hourly data at daily granularity and today is incomplete)

DATETRUNC(DATETIME("yesterday"), day)orDATETRUNC(DATEADD(DATETIME("now"), -1, day), day)

Druid datasources vs Druid tables

We should stop adding datasources to the "Druid Datasources" list if possible in favor of Druid tables (backed by Druid SQLAlchemy mappings). In order to add a Druid table, do the following:

  • Go to Sources -> Tables
  • Hit the + sign to add one
  • Use Druid Analytics SQL as database, druid as schema and the Druid datasource name as table name (like edits_hourly, wmf_netflow, etc..)

Migrate a chart to Druid tables

It is easy to spot if a chart is not using a Druid table (but an old Druid datasource) simply checking the chart list:

Example of old Druid datasource: the name pageviews_daily is not prepended with the prefix 'druid.'

In the above example the "pageviews_daily" datasource is not prepended with "druid." (like druid.pageviews_daily) so it is an old Druid datasource. If you want to move the chart to the Druid table, hit the edit button for it and switch the datasource name (it will likely change some default settings so make sure that the chart is correctly visualized) and then save it. You dashboards will be updated automatically!

If the druid.your-datasource-name is not present yet, please check the above section, since it may need to be defined via Druid -> Tables menu'.

SQL Lab

Superset allows user to query data via SQL using a dedicated tool called SQL Lab: https://superset.wikimedia.org/superset/sqllab. Multiple databases are available to query:

  • presto_analytics_hive to explore Data Lake data using Presto
  • Druid Analytics SQL to explore Druid data cubes (note: the full power of SQL is not available for Druid data)
  • mysql wikishared : contains data from MariaDB database. It has data from the x1 cluster.
  • mysql_staging: a staging database created for analysts to dump data from various wiki projects. This isn't used anymore.
  • Note: SQLLab doesn't have all the MariaDB replicas integrated yet.
  • etc..

Administration

Please see Superset/Administration for more information.

See also