Analytics/Data Lake/ORES

From Wikitech
Jump to navigation Jump to search

These tables contain ORES scores for MediaWiki revisions and pages.

Datasets

Jobs

Data is transformed and imported into these tables in several steps.

  • Import recent ORES revision scores.
  • Backfill old revisions so that we have a complete set of scores.
  • Join scores with historified context.
  • Monthly "current" dumps using the most recent available model versions.
  • Monthly "historical" dumps which include all available scores, from any model version.

Open questions and concerns

Mixed model_versions: We can't calculate scores with an old model version once a newer one has been deployed, which is problematic for backfilling. Our current workaround will be to backfill using an arbitrary, current model version. For the same reason, even the "current" dump file will include heterogenous scores from different model versions. Clients will have to take this into account. In the future, we might be able to run older models using Spark and backfill completely.