Data Platform/Data Lake/Data Issues/2024-10-10 Webrequest Data Loss - Clobbered Hadoop Temporary Dir
2024-10-10 Webrequest Data Loss - clobbered hadoop _temporary dir
Status | In Progress | |
Severity | High | |
Business data steward | Omari Sefu | |
Technical data steward | Andreas Hoelzl | |
Incident coordinator | Andrew Otto | |
Incident response team | Andrew Otto, Joseph Allemandou(lead data engineer), Antoine Quhen | |
Data detected | 2024-10-10 | |
Date resolved | 2024-10-18 | |
Start of issue | Since 2023-05-11, when webrequest refine job was migrated to Spark | |
Phabricator ticket | T376882 2024-10-10 Data Loss Incident - webrequest Hive table |
Summary
The Hive wmf.webrequest
table is missing a small percentage of data. This affects metrics computed by any dataset downstream of wmf.webrequest
.
The data is missing due to an issue with parallel ingestion of data into the Hive table. There is no missing data in input raw webrequest logs. We will be able to backfill data for which we still have raw data. Downstream jobs will need to be rerun.
Webrequest data between 2023-05-11 and 2024-09-09 has a small percentage of loss. We roughly estimate this to be about 0.2% loss on average over time. Specific hours of data may have had more or less loss. We cannot calculate loss precisely before 2024-09-09.
We will be able to backfill data after 2024-09-09, so there should be no loss going forward due to this bug.
Recommendations
Detailed follow up steps below. Here is a summary:
To mitigate, we will use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
as described here, and then backfill data we still have, and rerun dependent downstream jobs.
Backfilling would be easier if we had:
- Airflow task lineage
- Direct mappings between datasets and airflow tasks
- Cascading rerun support, either via Airflow or something we implement
The datasets config project hoped to provide some of this.
Description
From: https://phabricator.wikimedia.org/T376882#10219432
The issue that causes this is described well in T347076#9334900.
Fixing this globally was considered in T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues, but it was decided to make airflow default to not allowing parallel execution of dag tasks by setting max_active_dag_runs=1
. This decision was then later reverted back to 3.
Instead, folks are advised to set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2
if they encounter this error when backfilling, as documented here.
We parallelize the refine webrequest by generating two DAGs for each hour, one for each webrequest_source
partition, which are currently only 'upload' and 'text'. This was done because many downstream jobs only care about the webrequest_upload=text
Hive partition. Refining these separately allows for users to depend on one or the other without having to wait for both.
Note that the use of max_active_dag_runs=1
would not have avoided this current issue with webrequest loss. These are separate dags, so their tasks are allowed to run in parallel even if max_active_dag_runs=1
(it is currently 3 anyway).
The use of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2
is basically necessary if parallel writes are to be used on non iceberg Hive tables. We were not using it for webrequest.
Root Cause
Due to a bug in Hadoop, parallel writes from Spark to a Hive table can result in data loss in some scenarios.
From T347076#9334900
- When writing into a Hive table, spark creates a temporary folder at the table root path, named _temporary
- If multiple Spark applications insert into the same table at the same time, they share the same _temporary folder (what a mess!)
- Data written by Spark tasks in the _temporary folder is of the form: _temporary/ATTEMPT_NUMBER/_temporary/ATTEMPT_ID when a task is still running, and _temporary/ATTEMPT_NUMBER/TASK_ID when the task is done.
- When the overall job is done, files are moved from the various _temporary/ATTEMPT_NUMBER/TASK_ID folders into the Hive table partition output destination on HDFS. This step does not move files from other jobs tasks as the Hadoop ApplicationMaster knows which tasks names to expect.
- When the Spark session is closed and the Hadoop job is finishing, the _temporary folder is deleted - This is the step leading to an issue, as if the folder is still being used by another job. Any data already written by the other job to _temporary is lost before it is committed and moved to its final output location.
- The other job doesn’t fail, even as _temporary is deleted, because the task output writers will create the temporary directory on demand.
Affected Datasets and Services
All datasets downstream of wmf.webrequest
are affected.
The Webrequest dataloss 2024-10 spreadsheet has list of affected airflow jobs needed to be rerun.
Affected datasets include the following.
Hive Table
- Web Requests (webrequest)
- Pageviews ( pageview_hourly, pageview_actor)
- Projectview hourly
- Mediacounts
- Uniques Devices
- Browser general
- Mediawiki API request
- Mobile apps sessions and uniques metrics
- Interlanguage navigation
Graphite metrics
These cannot be rerun, as graphite does not support overwriting.
- api_metrics
- wikidata
Followup Steps
Mitigation:
- ✅ Set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 for refine webrequest job. This will solve the ongoing problem with minimal side effects.
- ✅ Rerun all airflow jobs downstream of webrequest.
Long Term Solutions
- ✅ T351388 Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues. This will make all airflow scheduled spark jobs safe for concurrent writes to different Hive partitions.
- T377006 Fail Spark job or airflow task if unexpected number of output files
- T377333 Set up Alerting for Data Quality dags in Airflow.
- Might be difficult to do seasonal and daily variations
- Upstream a fix to Hadoop for MAPREDUCE-7331 - Make temporary directory used by FileOutputCommitter configurable
- and/or fork our own FileOutputCommitter with configurable _temp dir?