The pipeline providing Edits data in the Data Lake has five main steps:
- Loading - Import data from MySQL to hadoop (1 hadoop job per MySQL table per wiki, so a lot of jobs !)
- User and page history reconstruction - Spark jobs rebuilding user and page history that don't exists as-is in MySQL
- Revision augmentation and denormalization - Enhance historical revisions with interesting fields and join with user and page history for a fully historified denormalized table.
- Metrics computation - Compute some standard metrics from the denormalized table and store them in Hive.
- Serving layer loading - Currently, load 2 years of data in Druid, but more to come !