Analytics/Archive/Spark/Migration to Spark 3

This page contains historical information. It may be outdated or unreliable.

The following information is retained only for historical purposes. The migration to spark3 has now been completed for some time.

The Data Engineering team has now upgraded to Spark 3 and no longer supports Spark 2. The official date upon which support ceased was 31 March 2023. The spark3 shuffler service was enabled on 5 July 2023. Please write all future jobs using Spark 3 and by now you should have finished migrating your existing jobs to Spark 3. If you still need help in doing so, please reach out.

If you are using Spark via Wmfdata-Python, you will just need to switch to a conda-analytics environment and Wmfdata-Python will automatically start using Spark 3.

Spark 2 and Spark 3

We have been supporting the execution of both Spark 2 (2.4.4) and Spark 3 (3.1.2) jobs in our Hadoop cluster for some time now. That said, Spark 3 has advantages over Spark 2 in several areas: the usual improvements in performance and security, as well as new features like better Python support (especially using Pandas) and SQL Catalog. So we prefer using Spark 3, also to stay on top of the latest Spark releases.

There can be only one

We have successfully migrated many of our Spark 2 jobs to Spark 3 already, but we are facing an issue. Spark 2 and Spark 3 can not fully co-exist in the same cluster due to the Spark shuffler. The shuffler is a core component of the Spark system for which there can only be 1 version installed per cluster. ~~Right now we’re using the Spark 2 shuffler, because Spark 3 supports it.~~ However, Spark 3 will not work at its full capacity until we pair it with the corresponding Spark 3 shuffler, which has valuable new features like automatic handling of skewed data. Unfortunately, Spark 2 can not run with the Spark 3 shuffler. The spark3 shuffler has now been enabled and the spark2 shuffler has been decommissioned.

Spark 2 deprecation

This is why we have decided to set a date at which we’ll stop supporting the execution of Spark 2 jobs in our Hadoop cluster, so we can finally switch to using the Spark 3 shuffler. The chosen date was the end of Q3 (March 31st, 2023). We put together a list of jobs that are currently using Spark 2 below. If your jobs are on the list, please migrate them as soon as possible. If you have Spark2 jobs that are not on the list, please add them, and then migrate them.

Impact

This change will affect all teams that have Spark 2 jobs running in our Hadoop cluster. Teams will need to change their job submissions to point to the Spark 3 executable and libraries. It’s likely that the Spark 2 code needs (small) syntax adaptations, and we recommend that all migrated jobs are re-tested, and the data they produce re-checked. We haven’t had any major issues migrating our Spark 2 jobs to Spark 3 so far, but we recommend teams reserve some time for potential problems. Likely things to happen are: dependency issues (if you use internal Spark APIs extensively), or query plan changes that might alter the performance characteristics of computationally heavy jobs (most times for the better, but sometimes a regression can happen).

Help

We Data Engineering will help teams solve their issues with Spark 3 migration as much as we can. You can reach out to us during our biweekly office hours (Tuesdays at 16:00 UTC), or via Slack using the #data-engineering channel. Here’s the official Spark docs on migrating from Spark 2 to Spark 3.

List of Spark 2 jobs to migrate

Dataset / Project	Job count	Owners	Stakeholders	Needs Airflow migration	Risk	Complexity	Spark 3 compatible	Spark 3 tested	Spark 3 in production	Comments
Puppet Druid loading	8 (+1 in test cluster)	Data Engineering	MediaWiki Core, Traffic	Yes	Medium	Low	Yes	No	No	All these jobs use the same shared Spark code. We only need to migrate this one, which should be used as well in the Airflow Druid loading jobs.
Refine	4 (+2 in test cluster)	Data Engineering	Data Engineering	Yes	High	High	Yes	Yes	No	All these jobs use the same shared Spark code. We only need to migrate this one, although it's large and complex (9 scala files, 4000 code lines).
MediaWiki History	4	Data Engineering	Data Engineering	Yes	Medium	High	Yes	Yes	No	This code is the most complex of all. It consists of 33 scala files, about 9000 lines of code. Currently 99.99% of Spark 3 output matches Spark 2.
HDFSCleaner	4	Data Engineering	Data Engineering	Yes	Low	Low	Yes	No	No	All these jobs use the same shared Spark code. We only need to migrate this one (1 scala file).
HistoricalProjectcountsRaw	1	Data Engineering	Data Engineering	No	Low	Low	Yes	No	No	Not sure we need to migrate this one. It is a legacy job that was made to be executed just once. And I think we won't have to run it again.
ProduceCanaryEvents	1	Data Engineering	Data Engineering	Yes	Medium	Low	Yes	No	No
WebrequestSubsetPartitioner	1	Data Engineering	Data Engineering	No?	Low?	Low	Yes	No	No	I think this job is currently not running anywhere? Not sure if we need to migrate it.
refinery-spark libraries	8 libraries	Data Engineering	Data Engineering	No	Medium	Medium	Yes	No	No	Some of these libraries are used by the jobs listed in this sheet. But wanted to make them explicit, so that we don't forget any of them.
Glent	7	Search	Search	No	Low	Medium	Unknown	No	No
WDQS Spark tools	6 libraries	Search?, Wikidata?	Search, Wikidata	No	Medium	Medium	Unknown	No	No
Mjolnir	8	Search	Search	No	Low	High	Unknown	No	No
Search satisfaction	2	Search	Search	No	Low		Unknown	No	No
Cirrus namespace map	1	Search	Search	No	Low	Low	Unknown	No	No
Relforge queries	1	Search	Search	No	Low	Low	Unknown	No	No
Head queries	1	Search	Search	No	Low	Low	Unknown	No	No
MediaWiki recommendation create	1	Search	Search	No	Low	Low	Unknown	No	No
MediaWiki revision predictions	4	Search	Search	No	Low		Unknown	No	No
Convert to elasticsearch bulk	5	Search	Search, SDoC?	No	Medium	Low	Unknown	No	No
Image suggestions to Cassandra	9 queries	Structured Data	Structured Data	Yes	Low	Medium	Unknown	Yes	Yes	Work done via https://phabricator.wikimedia.org/T323108
Commonswiki search index	9 queries	Structured Data	Structured Data	Yes	Low	Medium	Unknown	Yes	Yes	Work done via https://phabricator.wikimedia.org/T323108
Image suggestion indeces	1	Structured Data	Structured Data	Yes	Low	Medium	Unknown	Yes	Yes	Work done via https://phabricator.wikimedia.org/T323108
Section topics	1	Structured Data	Structured Data	No	Low	Low	Yes	Yes	Yes	Performance decrease due to Spark's query plan change, see https://phabricator.wikimedia.org/T323107
Section Alignment		Research					Unknown	No	No
Image Section Recommendation	2	Research		Yes			Unknown	Yes	Yes	Work done via https://phabricator.wikimedia.org/T328641
Content gaps metrics
Project template
Image features
Link recommendation (add-a-link)	2	Research	Growth, ML	No			Unknown	No	No	2 spark jobs when training the link-recommendation model for a new language https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh
Welcome Survey aggregation	1	Growth	Growth, Product Analytics	No	Low	Low	Unknown	No	No	Runs monthly, https://github.com/nettrom/Growth-welcomesurvey-2018/blob/master/T275172_survey_aggregation.ipynb
iCloud Private Relay usage	1	Product Analytics	Product Analytics	No	Low	Low	Unknown	No	No	Runs daily, stat1006:~nettrom/src/T289795/T292106-relay-pageviews.ipynb