Talk:Data Engineering/Systems/Airflow/Developer guide

Rendered with Parsoid
From Wikitech

Documenting Spark config changes for concurrent runs

Thank you @Joal for documenting!

Would it be possible to expand on "The solution is to add a parameter to your spark job (whether in SQL or as a spark configuration parameter): spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2"

For queries, would one use

SET spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2;

INSERT INTO ...

potentially in a fork/branch and if using SparkSqlOperator operator then would modify the HQL URI variable in Airflow UI to temporarily point the DAG at this concurrent-backfill-friendly version of the query for the duration of the backfill?

As for Spark configuration parameter, is that something that can be temporarily done using Airflow UI for the duration of the backfill or would the DAG need to be modified and deployed? E.g. changing conf in the compute step in https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics_product/dags/wikipedia_chatgpt_plugin/wikipedia_chatgpt_plugin_searches_daily_dag.py Bearloga (talk) 16:25, 22 November 2023 (UTC)Reply

Thank you for your comment @Bearloga - I have just updated the section, I hope it's better :) --Joal (talk) 19:41, 4 December 2023 (UTC)Reply
Thank you so much! It's very clear!
I'm confused, though, because your edit says "3 instances of a dag_run per dag at a time. We have chosen to make this rule the default to preventdespite an issue with spark failingfailling when having concurrent jobs writing to the same table (even if in different partitions)" but that's not true? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/airflow.pp still has max_active_runs_per_dag as 1 and I thought we reached a consensus to use 1? The only time that we'd need multiple instances of a DAG is when backfilling and when backfilling we run into concurrency problem. Bearloga (talk) 23:55, 11 December 2023 (UTC)Reply