Analytics/Archive/Spark/Migration to Spark 3

From Wikitech
This page contains historical information. It may be outdated or unreliable.
The following information is retained only for historical purposes. The migration to spark3 has now been completed for some time.

The Data Engineering team has now upgraded to Spark 3 and no longer supports Spark 2. The official date upon which support ceased was 31 March 2023. The spark3 shuffler service was enabled on 5 July 2023. Please write all future jobs using Spark 3 and by now you should have finished migrating your existing jobs to Spark 3. If you still need help in doing so, please reach out.

If you are using Spark via Wmfdata-Python, you will just need to switch to a conda-analytics environment and Wmfdata-Python will automatically start using Spark 3.

Spark 2 and Spark 3

We have been supporting the execution of both Spark 2 (2.4.4) and Spark 3 (3.1.2) jobs in our Hadoop cluster for some time now. That said, Spark 3 has advantages over Spark 2 in several areas: the usual improvements in performance and security, as well as new features like better Python support (especially using Pandas) and SQL Catalog. So we prefer using Spark 3, also to stay on top of the latest Spark releases.

There can be only one

We have successfully migrated many of our Spark 2 jobs to Spark 3 already, but we are facing an issue. Spark 2 and Spark 3 can not fully co-exist in the same cluster due to the Spark shuffler. The shuffler is a core component of the Spark system for which there can only be 1 version installed per cluster. Right now we’re using the Spark 2 shuffler, because Spark 3 supports it. However, Spark 3 will not work at its full capacity until we pair it with the corresponding Spark 3 shuffler, which has valuable new features like automatic handling of skewed data. Unfortunately, Spark 2 can not run with the Spark 3 shuffler. The spark3 shuffler has now been enabled and the spark2 shuffler has been decommissioned.

Spark 2 deprecation

This is why we have decided to set a date at which we’ll stop supporting the execution of Spark 2 jobs in our Hadoop cluster, so we can finally switch to using the Spark 3 shuffler. The chosen date was the end of Q3 (March 31st, 2023). We put together a list of jobs that are currently using Spark 2 below. If your jobs are on the list, please migrate them as soon as possible. If you have Spark2 jobs that are not on the list, please add them, and then migrate them.

Impact

This change will affect all teams that have Spark 2 jobs running in our Hadoop cluster. Teams will need to change their job submissions to point to the Spark 3 executable and libraries. It’s likely that the Spark 2 code needs (small) syntax adaptations, and we recommend that all migrated jobs are re-tested, and the data they produce re-checked. We haven’t had any major issues migrating our Spark 2 jobs to Spark 3 so far, but we recommend teams reserve some time for potential problems. Likely things to happen are: dependency issues (if you use internal Spark APIs extensively), or query plan changes that might alter the performance characteristics of computationally heavy jobs (most times for the better, but sometimes a regression can happen).

Help

We Data Engineering will help teams solve their issues with Spark 3 migration as much as we can. You can reach out to us during our biweekly office hours (Tuesdays at 16:00 UTC), or via Slack using the #data-engineering channel. Here’s the official Spark docs on migrating from Spark 2 to Spark 3.

List of Spark 2 jobs to migrate

Dataset / Project Job count Owners Stakeholders Needs Airflow migration Risk Complexity Spark 3

compatible

Spark 3

tested

Spark 3 in production Comments
Puppet Druid loading 8 (+1 in test cluster) Data Engineering MediaWiki Core, Traffic Yes Medium Low Yes No No All these jobs use the same shared Spark code. We only need to migrate this one, which should be used as well in the Airflow Druid loading jobs.
Refine 4 (+2 in test cluster) Data Engineering Data Engineering Yes High High Yes Yes No All these jobs use the same shared Spark code. We only need to migrate this one, although it's large and complex (9 scala files, 4000 code lines).
MediaWiki History 4 Data Engineering Data Engineering Yes Medium High Yes Yes No This code is the most complex of all. It consists of 33 scala files, about 9000 lines of code. Currently 99.99% of Spark 3 output matches Spark 2.
HDFSCleaner 4 Data Engineering Data Engineering Yes Low Low Yes No No All these jobs use the same shared Spark code. We only need to migrate this one (1 scala file).
HistoricalProjectcountsRaw 1 Data Engineering Data Engineering No Low Low Yes No No Not sure we need to migrate this one. It is a legacy job that was made to be executed just once. And I think we won't have to run it again.
ProduceCanaryEvents 1 Data Engineering Data Engineering Yes Medium Low Yes No No
WebrequestSubsetPartitioner 1 Data Engineering Data Engineering No? Low? Low Yes No No I think this job is currently not running anywhere? Not sure if we need to migrate it.
refinery-spark libraries 8 libraries Data Engineering Data Engineering No Medium Medium Yes No No Some of these libraries are used by the jobs listed in this sheet. But wanted to make them explicit, so that we don't forget any of them.
Glent 7 Search Search No Low Medium Unknown No No
WDQS Spark tools 6 libraries Search?, Wikidata? Search, Wikidata No Medium Medium Unknown No No
Mjolnir 8 Search Search No Low High Unknown No No
Search satisfaction 2 Search Search No Low Unknown No No
Cirrus namespace map 1 Search Search No Low Low Unknown No No
Relforge queries 1 Search Search No Low Low Unknown No No
Head queries 1 Search Search No Low Low Unknown No No
MediaWiki recommendation create 1 Search Search No Low Low Unknown No No
MediaWiki revision predictions 4 Search Search No Low Unknown No No
Convert to elasticsearch bulk 5 Search Search, SDoC? No Medium Low Unknown No No
Image suggestions to Cassandra 9 queries Structured Data Structured Data Yes Low Medium Unknown Yes Yes Work done via https://phabricator.wikimedia.org/T323108
Commonswiki search index 9 queries Structured Data Structured Data Yes Low Medium Unknown Yes Yes Work done via https://phabricator.wikimedia.org/T323108
Image suggestion indeces 1 Structured Data Structured Data Yes Low Medium Unknown Yes Yes Work done via https://phabricator.wikimedia.org/T323108
Section topics 1 Structured Data Structured Data No Low Low Yes Yes Yes Performance decrease due to Spark's query plan change, see https://phabricator.wikimedia.org/T323107
Section Alignment Research Unknown No No
Image Section Recommendation 2 Research Yes Unknown Yes Yes Work done via https://phabricator.wikimedia.org/T328641
Content gaps metrics
Project template
Image features
Link recommendation (add-a-link) 2 Research Growth, ML No Unknown No No 2 spark jobs when training the link-recommendation model for a new language https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh
Welcome Survey aggregation 1 Growth Growth, Product Analytics No Low Low Unknown No No Runs monthly, https://github.com/nettrom/Growth-welcomesurvey-2018/blob/master/T275172_survey_aggregation.ipynb
iCloud Private Relay usage 1 Product Analytics Product Analytics No Low Low Unknown No No Runs daily, stat1006:~nettrom/src/T289795/T292106-relay-pageviews.ipynb