Jump to content

Data Platform/Data Lake/Data Issues

From Wikitech

We recommend the following approaches for excluding or annotating data that contains known data quality issues:

  • Use date filters to exclude data from analysis for the affected time period
  • For time series visualizations:
    • Visually block out the period of the data loss and add annotation with the problem summary and from and to dates. For example:
      Wikistats pageviews time series graph with data loss period visually blocked
      Between June 2021 and January 2022, pageview data was underreported due to caching nodes in the US data centers that had stopped collecting traffic data. For more details see the /2021-06-04 Traffic Data Loss report on Wikitech. Time series graph from Wikistats.
    • Use overlays to annotate the data. For users of Superset an annotation layer can be created and reused. For example, for the /2021-06-04 Traffic Data Loss, an annotation layer is available called “Pageview Data Loss June 2021-January 2022”:
      Time series graph of pageviews from Superset, with data loss period annotated using Superset annotation layers.
      Between June 2021 and January 2022, pageview data was underreported due to caching nodes in the US data centers that had stopped collecting traffic data. For more details see the /2021-06-04 Traffic Data Loss report on Wikitech. Time series graph from Superset, showing annotation layer with mouseover.
    • For point in time issues, use a data point annotation.
  • When it is not feasible to remove data from an existing report or dashboard, add an annotation or footnote describing the impact of the data issue.

All Subpages of Data Platform/Data Lake/Data Issues