Analytics/Systems/Data Lake/Edits/Pipeline

From Wikitech
Jump to: navigation, search

The pipeline providing Edits data in the Data Lake has five main steps:

  1. Loading - Import data from MySQL to hadoop (1 hadoop job per MySQL table per wiki, so a lot of jobs !)
  2. User and page history reconstruction - Spark jobs rebuilding user and page history that don't exists as-is in MySQL
  3. Revision augmentation and denormalization - Enhance historical revisions with interesting fields and join with user and page history for a fully historified denormalized table.
  4. Metrics computation - Compute some standard metrics from the denormalized table and store them in Hive.
  5. Serving layer loading - Currently, load 2 years of data in Druid, but more to come !