Analytics/Archive/Spark/Migration to Spark 3
The Data Engineering team has now upgraded to Spark 3 and no longer supports Spark 2. The official date upon which support ceased was 31 March 2023. The spark3 shuffler service was enabled on 5 July 2023. Please write all future jobs using Spark 3 and by now you should have finished migrating your existing jobs to Spark 3. If you still need help in doing so, please reach out.
If you are using Spark via Wmfdata-Python, you will just need to switch to a conda-analytics
environment and Wmfdata-Python will automatically start using Spark 3.
Spark 2 and Spark 3
We have been supporting the execution of both Spark 2 (2.4.4) and Spark 3 (3.1.2) jobs in our Hadoop cluster for some time now. That said, Spark 3 has advantages over Spark 2 in several areas: the usual improvements in performance and security, as well as new features like better Python support (especially using Pandas) and SQL Catalog. So we prefer using Spark 3, also to stay on top of the latest Spark releases.
There can be only one
We have successfully migrated many of our Spark 2 jobs to Spark 3 already, but we are facing an issue. Spark 2 and Spark 3 can not fully co-exist in the same cluster due to the Spark shuffler. The shuffler is a core component of the Spark system for which there can only be 1 version installed per cluster. Right now weâre using the Spark 2 shuffler, because Spark 3 supports it. However, Spark 3 will not work at its full capacity until we pair it with the corresponding Spark 3 shuffler, which has valuable new features like automatic handling of skewed data. Unfortunately, Spark 2 can not run with the Spark 3 shuffler. The spark3 shuffler has now been enabled and the spark2 shuffler has been decommissioned.
Spark 2 deprecation
This is why we have decided to set a date at which weâll stop supporting the execution of Spark 2 jobs in our Hadoop cluster, so we can finally switch to using the Spark 3 shuffler. The chosen date was the end of Q3 (March 31st, 2023). We put together a list of jobs that are currently using Spark 2 below. If your jobs are on the list, please migrate them as soon as possible. If you have Spark2 jobs that are not on the list, please add them, and then migrate them.
Impact
This change will affect all teams that have Spark 2 jobs running in our Hadoop cluster. Teams will need to change their job submissions to point to the Spark 3 executable and libraries. Itâs likely that the Spark 2 code needs (small) syntax adaptations, and we recommend that all migrated jobs are re-tested, and the data they produce re-checked. We havenât had any major issues migrating our Spark 2 jobs to Spark 3 so far, but we recommend teams reserve some time for potential problems. Likely things to happen are: dependency issues (if you use internal Spark APIs extensively), or query plan changes that might alter the performance characteristics of computationally heavy jobs (most times for the better, but sometimes a regression can happen).
Help
We Data Engineering will help teams solve their issues with Spark 3 migration as much as we can. You can reach out to us during our biweekly office hours (Tuesdays at 16:00 UTC), or via Slack using the #data-engineering channel. Hereâs the official Spark docs on migrating from Spark 2 to Spark 3.
List of Spark 2 jobs to migrate
Dataset / Project | Job count | Owners | Stakeholders | Needs Airflow migration | Risk | Complexity | Spark 3
compatible |
Spark 3
tested |
Spark 3 in production | Comments |
---|---|---|---|---|---|---|---|---|---|---|
Puppet Druid loading | 8 (+1 in test cluster) | Data Engineering | MediaWiki Core, Traffic | Yes | Medium | Low | Yes | No | No | All these jobs use the same shared Spark code. We only need to migrate this one, which should be used as well in the Airflow Druid loading jobs. |
Refine | 4 (+2 in test cluster) | Data Engineering | Data Engineering | Yes | High | High | Yes | Yes | No | All these jobs use the same shared Spark code. We only need to migrate this one, although it's large and complex (9 scala files, 4000 code lines). |
MediaWiki History | 4 | Data Engineering | Data Engineering | Yes | Medium | High | Yes | Yes | No | This code is the most complex of all. It consists of 33 scala files, about 9000 lines of code. Currently 99.99% of Spark 3 output matches Spark 2. |
HDFSCleaner | 4 | Data Engineering | Data Engineering | Yes | Low | Low | Yes | No | No | All these jobs use the same shared Spark code. We only need to migrate this one (1 scala file). |
HistoricalProjectcountsRaw | 1 | Data Engineering | Data Engineering | No | Low | Low | Yes | No | No | Not sure we need to migrate this one. It is a legacy job that was made to be executed just once. And I think we won't have to run it again. |
ProduceCanaryEvents | 1 | Data Engineering | Data Engineering | Yes | Medium | Low | Yes | No | No | |
WebrequestSubsetPartitioner | 1 | Data Engineering | Data Engineering | No? | Low? | Low | Yes | No | No | I think this job is currently not running anywhere? Not sure if we need to migrate it. |
refinery-spark libraries | 8 libraries | Data Engineering | Data Engineering | No | Medium | Medium | Yes | No | No | Some of these libraries are used by the jobs listed in this sheet. But wanted to make them explicit, so that we don't forget any of them. |
Glent | 7 | Search | Search | No | Low | Medium | Unknown | No | No | |
WDQS Spark tools | 6 libraries | Search?, Wikidata? | Search, Wikidata | No | Medium | Medium | Unknown | No | No | |
Mjolnir | 8 | Search | Search | No | Low | High | Unknown | No | No | |
Search satisfaction | 2 | Search | Search | No | Low | Unknown | No | No | ||
Cirrus namespace map | 1 | Search | Search | No | Low | Low | Unknown | No | No | |
Relforge queries | 1 | Search | Search | No | Low | Low | Unknown | No | No | |
Head queries | 1 | Search | Search | No | Low | Low | Unknown | No | No | |
MediaWiki recommendation create | 1 | Search | Search | No | Low | Low | Unknown | No | No | |
MediaWiki revision predictions | 4 | Search | Search | No | Low | Unknown | No | No | ||
Convert to elasticsearch bulk | 5 | Search | Search, SDoC? | No | Medium | Low | Unknown | No | No | |
Image suggestions to Cassandra | 9 queries | Structured Data | Structured Data | Yes | Low | Medium | Unknown | Yes | Yes | Work done via https://phabricator.wikimedia.org/T323108 |
Commonswiki search index | 9 queries | Structured Data | Structured Data | Yes | Low | Medium | Unknown | Yes | Yes | Work done via https://phabricator.wikimedia.org/T323108 |
Image suggestion indeces | 1 | Structured Data | Structured Data | Yes | Low | Medium | Unknown | Yes | Yes | Work done via https://phabricator.wikimedia.org/T323108 |
Section topics | 1 | Structured Data | Structured Data | No | Low | Low | Yes | Yes | Yes | Performance decrease due to Spark's query plan change, see https://phabricator.wikimedia.org/T323107 |
Section Alignment | Research | Unknown | No | No | ||||||
Image Section Recommendation | 2 | Research | Yes | Unknown | Yes | Yes | Work done via https://phabricator.wikimedia.org/T328641 | |||
Content gaps metrics | ||||||||||
Project template | ||||||||||
Image features | ||||||||||
Link recommendation (add-a-link) | 2 | Research | Growth, ML | No | Unknown | No | No | 2 spark jobs when training the link-recommendation model for a new language https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh | ||
Welcome Survey aggregation | 1 | Growth | Growth, Product Analytics | No | Low | Low | Unknown | No | No | Runs monthly, https://github.com/nettrom/Growth-welcomesurvey-2018/blob/master/T275172_survey_aggregation.ipynb |
iCloud Private Relay usage | 1 | Product Analytics | Product Analytics | No | Low | Low | Unknown | No | No | Runs daily, stat1006:~nettrom/src/T289795/T292106-relay-pageviews.ipynb |