Monitoring/EdgeTrafficDrop

From Wikitech

EdgeTrafficDrop is a Prometheus Alertmanager alert defined in traffic.yaml on the operations/alerts repo. The alert fires if there is a significant percentage difference in request rate compared to the recent past, and may be indicative of traffic anomalies.

Things to do

Check the dashboard varnish-caching-last-week-comparison for the affected cluster/site. For example, if the alert says "44% GET drop in text@codfw during the past 30 minutes", you want to select the text cluster and codfw as the site. If the shape of the curve is a clear drop without previous increase as shown in the image Traffic Drop on the right, this could mean that we served less traffic than normal due to either an attack or some anomalies in our infrastructure. If the pattern does not seem to recover on its own, page the Traffic team.

Traffic Drop

If instead the curve looks like a spike as show in the image Traffic Spike on the right, that is likely due to some anomalous incoming traffic and in general there's not much to worry about.

Traffic Spike

Regardless of the shape of the curve, you should do the following:

  • Look at the load-balancers-lvs dashboard for the given site. If you see clear spikes there, it's probably a DoS attack. See the (D)DoS Playbook.
  • Take a look at varnish-aggregate-client-status-codes for the relevant site/cluster to learn more about the type of traffic, in particular whether any specific method/status code stands out.
  • Dig into webrequest_sampled_128 on Turnilo for the specific details of the type of requests causing the spike.
  • Let #wikimedia-traffic know about your findings.