Data Platform Engineering/Ops week/Analytics weekly train
...🚂🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃
Analytics deployment train
☑️ Only add here stuff that has been merged.
☑️ Link the task and the Gerrit patch.
☑️ List the systems that need deploying, jar versions that need bump-ups, and jobs that need restarting, if there are any.
Extra points if you include what to run and where to run it (e.g. stat1007, an-coord1001...).
☑️ Do you have a way of checking the deployment has been successful?
☑️ Don't move stuff to "ready to deploy" in the kanban unless it's documented here.
☑️ Check Data_Engineering/Ops_week#The_Data_Engineering_deployment_train_🚂 for a pointer about Wikistats, as well as links for various types of deployments.
☑️ To see the old log, go to https://etherpad.wikimedia.org/p/analytics-weekly-train/timeslider#59747.
Now use the log below. Eventually we could have some sub-pages or templates to streamline this.
YYYY-MM-DD NEXT TUESDAY TRAIN (REPLACE THIS AFTER DEPLOY)
NEXT TRAIN (Thursday, March 31, 2026)
Refinery:
- move hql script from fundraising to fr_tech | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1260793
Thursday, March 25, 2026
Deployer: Aisha and Sandra
Refinery:
- Add abstract.wikipedia to pageview allowlist | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1256413
- Changes to mapper-weight for centralauth_localuser | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1256302
- Move bot detection pipeline into new repo | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1237928
Thursday, March 10, 2026
By mforns
Refinery:
- Add kai.wikipedia to the pageview allowlist
https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1249328 DONE (sync'ed by hand)
Airflow:
- Artifact cleaning: remove outdated refinery job artifacts | https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2030#d6baf582888b568d8a7bcb95316bd03cbefa9853 | Note this change might cause some backward compatibility issues and we would need to monitor the DAGs closely after deployment.
DONE
2026-03-11
Deployer: joal Refinery-source:
- Update ProduceCanaryEvents job https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1249982 + https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1250016
2026-03-05 (special Thursday post cleanups)
Deployer: dr0ptp4kt (with Marcel and Sandra)
Refinery:
- Adapt imagelinks pipeline and consumers for imagelink normalization | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1239200
- No-op: Fix druid banner_activity data prep job | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1240253
Airflow:
- After refinery deployment. Pass mediawiki_private_linktarget_table to commons impact metrics dag | https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2026
2026-02-18
Deployer: joal
Refinery:
- Use names in banner activity GROUP BY - https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1239232
- Add first_campaign_status_code for banner activity - https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1238821
2026-02-10
Deployer: xcollazo
Refinery:
- 1235830: MediawikiDumper: fix filenames to include end revision when covering a single page. | https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1235830
- 1236347: Migrate cu_changes table to use cuua_text in new cu_usergent table. | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1236347
2026-02-01
Deployer: Joseph
Refinery:
- 1233834: Remove mediawiki_wikitext_* from refinery-drop-mediawiki-snapshots | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1233834
- Minor non-urgent patch. No need to release if just this patch.
- Update pageview project allowlist - https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1235201
- HQL for druid webrequest_sampled ingestion https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1235740
Airflow:
- Load webrequest_sampled in druid hourly https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1967
2026-01-21
Deployer: Antonio/Joseph
- Refinery
- Update pingback HQL code for new PHP and MediaWiki versions https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1222506
- Update pageview allowlist
- Update event _sanitized allowlist https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1207489
- Airflow
- Update pingback MediaWiki and PHP versions to include new values https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1909
- We need the refinery deployment done first
- After deploying this, from Cindy: Once the patches are merged, the weekly queries will need to be re-run starting from the beginning of May 2025. Xcollazo is happy to do this part after we deploy. Just ping Xcollazo.
- Update pingback MediaWiki and PHP versions to include new values https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1909
2025-12-03
Deployer: Antoine
Refinery:
- task T409584 Add JA3N User-Agent queries https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1212214 and https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1213488 and https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1213522 (no need to do anything else!)
2025-11-18
Deployer: Marcel and Javier
Refinery:
- task T405039 - Add HQL for edit_per_editor_per_page_daily and pageview_per_editor_per_page_daily https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1196892
DONE
2025-11-12
Deployer: Joal
Refinery-source:
- task T406531 - Add new referral sources to pageview data https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1203389
- task T408178 - Remove mediawiki.wikistories_* santization allowlist entries https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1202718
- T407239 - Fix Duplicate Pageview metrics records in data quality tables. | https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1203129
- T406000 Adapt mediawiki_history to the removal of mediawiki revision.rev_sha10 (1202334)
- 1203124: Fix bug MW Dumper in which vertical bars ( `|` ) were not being honored. | https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1203124
- After refine-source release, we should:
- merge https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1795 that will pick up this fix on the File Export DAGs
- wait until merge request makes it to main Airflow instance
- delete DagProperties at https://airflow.wikimedia.org/variable/edit/372 , so that the auto-regenerated one points to new jar
- resume the following DAGs, which have been cleared and are ready to go:
- After refine-source release, we should:
Airflow:
- task T406531 - Add new referral sources to pageview data - https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1796
- task T409470 - Fix mediawiki_history_dumps failure - https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1797
- https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1795 (see above in refinery-source section)
2025-11-05
Deployer: Joseph
Refinery Source:
- 1199485: Add Data quality check for Pageview Human-Bot ratio anomaly | https://gerrit.wikimedia.org/r/c/aalytics/refinery/source/+/1199485
- task T406531 - Add new referral sources to pageview data https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1198313
- Mediawiki-History Bug fix: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1202191
Airflow:
- T407239 - Add Dag to run daily Human to Bot page views ratio check https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1776 This MR should be deployed after refinery source is deployed. It needs refinery-job jar v0.3.7
- task T406531 - Add new referral sources to pageview data https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1780 This MR should be deployed after refinery source is deployed. It needs refinery-hive jar v0.3.7
2025-10-29
deployer: Sandra
Refinery Source:
- 1198080: Fix various bugs on MW Dumper code. | https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1198080
- 1198152: Add utility to create SHA256 fingerprints of the files of a particular HDFS folder. | https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1198152
2025-10-22
To-be deployer: Aleksander
- Refinery Source
- Add user_central_id to the mediawiki_history dataset(s) https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1194951
2025-10-14
To-be deployer: Marcel
- Refinery
- task T405533 - Unique devices data uses non-standard domains for Wikidata, Wikifunctions, and MediaWiki.org https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1194885 . Note: This task has a pending Airflow patch to be merged/deployed once this one is deployed: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1743 [DONE]
- task T406000 - Adapt mediawiki_history to the removal of mediawiki revision.rev_sha1 - https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1196716 Nullify sha1 in Sqoop [DONE]
- Refinery Source
- T365203 - Add check for wikis count to Mediawiki history data quality checks https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1193440 [DONE]
- T365203 - Bug Fix: Add support for Deequ Metric value Distribution data type https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1195268 [DONE]
- task T406000 - Adapt mediawiki_history to the removal of mediawiki revision.rev_sha1 - https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1196049 and https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1196469. Note: This patch needs a related Airflow patch: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1750. This one also: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1196485 [DONE]
- task T384945 Modify code to dump all slots AND Template:PabT Adapt MW Content pipelines to the removal of upstream revision.rev_sha1 - https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1195330 [DONE]