Image-suggestion/Runbook
Sometimes some of the data that the image suggestions, section topics, and/or SEAL pipelines rely on fails to generate.
Usually, the first you'll hear about a failure is an email to sd-alerts@lists.wikimedia.org, with SLA miss on DAG=image_suggestions_weekly and/or ImageSuggestionsTooLongSinceLastPush in the subject line.
What to do when that happens
- Log into https://airflow-platform-eng.wikimedia.org/
- Look for any failed sensor in the DAGs. So far, the most faulty Hive tables have been:
wmf.mediawiki_wikitext_current(monthly)wmf.wikidata_item_page_link(weekly)wmf.wikidata_entity(weekly)structured_data.commons_entity(weekly)
- Post a message in
#talk-to-data-engineeringon Slack and see if anyone knows why the partition didn't generate, and if they can kick off generation - Post a message in
#image-suggestions-and-sectionson Slack to inform downstream that an image suggestions snapshot was skipped
We should be able to handle a week or two of no data, and things will just pick up from where they left off.
However, you probably will want to turn off the alert - to do that log into a stat server and run image_suggestions/data_check.py manually.
If you do need to re-run a DAG for any reason, first pause the DAGs with the slider until everything is resolved. Once all the upstream data is ready then go to the grid view of the failed DAG you want to re-run, and click on the little red box that indicates where something has failed. Add a note to explain what happened, then click "clear" and it'll run again.
Once the failed DAG has finished you can unpause the other DAGs and they'll run in their own time.
DAGs timeout
All DAGs are set to timeout after 6 days as per the default configuration. Since they are all scheduled to run weekly on Thursdays, the timeout ensures no concurrent runs, as any hanging DAG stops on Wednesdays. The task that caused a DAG to timeout is marked with the skipped state and colored in purple in the Airflow Web UI, while the DAG itself is marked with the failed state.
Search indices
The ALIS DAG populates the analytics_platform_eng.image_suggestions_search_index_full Hive table with all the data relevant to image suggestions that should exist in the search indices. It also creates analytics_platform_eng.image_suggestions_search_index_delta, which is the difference between the latest set of image_suggestions_search_index_full data and the equivalent dataset from discovery.cirrus_index_without_content.
The search team has a DAG that picks up analytics_platform_eng.image_suggestions_search_index_delta and injects the data into the search indices.
If a DAG fails we write an empty partition, and the search team knows that's a noop.
Commons' threshold
If the amount of rows in Commons' delta/diff is above a threshold specified in ALIS' DAG variables, it won't be shipped.
Make sure you override the value if needed:
- Log into https://airflow-platform-eng.wikimedia.org/
- In the top tabs, go to
Admin > Variables - Click on the
Edit recordicon on the left ofplatform_eng/dags/alis_dag.pyKey - Set the
commons_delta_thresholdvalue
Cassandra TTL
The TTL for Cassandra data is 3 weeks, so if a pipeline has been failing for a while then the Cassandra data might just disappear.
You can reset the TTL as follows. Ensure that the cassandra DAG is paused before starting and is unpaused at the end.
- Click on the DAG
- Find the last successful run (green bar)
- Click on the
feed_suggestionssquare - Scroll down to check that the snapshot is correct
- Add a note describing what you’re doing
- Click on the green bar
- Click
ClearorSHIFT-cand confirm
ALIS and SLIS with different snapshots
If ALIS and SLIS have a different snapshot, e.g., 2025-01-20 and 2024-12-23 respectively, then the DAG must run twice.
Instructions for ALIS:
- Verify that the ALIS snapshot has zero SLIS
- In the top tabs, go to
Admin > Variables - Click on the
Edit recordicon on the left ofplatform_eng/dags/cassandra_dag.pyKey - Fill
weekly_snapshotwith the correct snapshot and save - Go back to the
cassandraDAG - Click on the green bar
- Click
ClearorSHIFT-cand confirm - Click on the
wait_for_SLISsquare - Click
Mark state as... > failedorSHIFT-fand confirm
Then, repeat the same procedure for SLIS, making sure the snapshot has zero ALIS and that you fail wait_for_ALIS instead.
Production deployment
If you release a new version of a pipeline and bump relevant target DAGs, the change will be automatically deployed.
Make sure the conda environment DAG variable is wiped out, or Airflow won't pick up the new pipeline release:
- Log into https://airflow-platform-eng.wikimedia.org/
- In the top tabs, go to
Admin > Variables - For each relevant DAG, click on the
Edit recordicon on its left - Delete the row starting with
"conda_env"and save
Switch release
If you want to use a different pipeline release, override the "conda_env" value with the version you want to run and save.