Machine Learning/LiftWing/Alerts

Kafka Consumer lag - ORESFetchScoreJobKafkaLag alert

When ORESFetchScoreJobKafkaLag alert fires it means that there is a lag between messages landing in Kafka topics and message consumption rate from Changeprop. What may be happening:

ORESFetchScoreJobs are taking too long to complete which results in an increased Job queue.
Lift Wing may be returning errors so Changeprop is retrying thus taking longer to complete.

First thing to do is explore mediawiki-errors on logstash and review the grafana Jobqueue dashboard to spot the time of the increase by investigating the "Job run duration" chart. Check if there were any deployments at that time affecting either the ORES extension of Lift Wing.

Changes for Lift Wing can be tracked on the inferenceservices repository. For the ORES extension we can follow the MediaWiki deployment notes. For example if we notice that the job duration increased when MediaWIki_1.31/wmf.29 was deployed, we visit the changelog page for MediaWiki and check if there is any deployment related to the ORES extension repository.

We should also explore the logs on Lift Wing to figure out if there is an obvious reason for this lag.

If there have been changes that increase job duration and nothing can else can be done, increasing the job concurrency value will improve the job duration (and empty the queue faster), but we should keep in mind that these resources are shared among jobs so increasing the concurrency may affect other jobs. This can be changed in changeprop-jobqueue values in the deployment-charts repository.

Inference Services High Memory Usage - InfServiceHighMemoryUsage alert

This alert fires when memory utilization of the kserve-container of an Inference Service is above 90% of the container limit for more than 5 minutes. If a deployment has been made recently we focus on the recent changes to find something that is increasing memory usage. Otherwise we explore the requests that have been made by looking at the logs to see if the issue is connected to the inputs. In both cases exploring the memory utilization before the spike by looking at the "Kubernetes Container Details" Grafana dashboard will help us understand if the specified resources were tight or there is an issue that needs to be resolved.

LiftWingServiceErrorRate

Istio Dashboard

This alert fires if the percentage of error status codes (5XX and 0) from an Istio service is larger than a given threshold. Note that this means 3XXs and 4XXs are not considered errors, since they are usually client-caused. The error code 0 indicates that the connection was closed (usually by the client) before a normal HTTP response code could be sent.

Possible causes for this alert are breakage in the LW service itself, or upstream timing our or failing in a way that is not handled well by the LW service. It may also be due to a misbehaving client. Note that going over the rate limit would cause 429 responses either from LW or the API GW, and so this alert would not fire.