Image-suggestion/Runbook

From Wikitech

Sometimes some of the data that the image-suggestions pipeline (or related pipelines) relies on fails to generate

Usually the first you'll hear about a failure is an email to sd-alerts@lists.wikimedia.org, with "ImageSuggestionsTooLongSinceLastPush" in the subject line.

What to do when that happens

First thing to do is pause the DAGs. Set up an ssh tunnel to the airflow server (ssh -t -N an-airflow1004.eqiad.wmnet -L 8600:127.0.0.1:8600), and then navigate to http://localhost:8600/. Use the little slider in the DAGs tab to pause the DAGs until everything is resolved.

Next go into Hive and check if the partitions that are specified in the DAGs exist. So far the most common one that has failed to generate has been wmf.wikidata_item_page_link.

Go to #working-with-data in slack and see if anyone knows why the partition didn't generate, and if they can kick off generation.

Once all the upstream data is ready then go to the grid view of each failed DAG run, and click on the little red box that indicates where something has failed. Add a note to explain what happened, then click "clear" and it'll run again.

Once the failed DAG has finished you can unpause the DAGs and subsequent DAGs will run

DAGs timeout

All DAGs are set to timeout after 6 days, see e.g., image suggestions. Since they are all scheduled to run weekly on Mondays, the timeout ensures no concurrent runs, as any hanging DAG stops on Sundays. The task that caused a DAG to timeout is marked with the skipped state and colored in purple in the Airflow Web UI, while the DAG itself is marked with the failed state.

Search indices

The image-suggestions DAG populates a Hive table analytics_platform_eng.image_suggestions_search_index_full with all the data relevant to image suggestions that should exist in the search indices. It also creates analytics_platform_eng.image_suggestions_search_index_delta, which is the difference between the latest set of image_suggestions_search_index_full data and the equivalent dataset from discovery.cirrus_index_without_content.

The search team have a DAG that picks up analytics_platform_eng.image_suggestions_search_index_delta and injects the data into the search indices.

Cassandra TTL

The TTL for Cassandra data is 3 weeks, so if a pipeline has been failing for a while then the Cassandra data might just disappear. You can reset the TTL like this:

  • Find most recent successful DAG
  • Click on the first hive_to_Cassandra node
  • Look at the rendered template to check that the snapshot is correct
  • Add a note describing what you’re doing
  • Click “clear” and confirm ... and then the node will re-run, resetting the Cassandra ttl in the process.