Mw-mcrouter

This is the daemonset proxying all mediawiki memcached requests, to our memcached cluster. It is running the almighty mcrouter

mcrouter image and exporter

Image is in the production images repo, where the defaults are set.

Image version in production is defined in the puppet repo under profile::kubernetes::deployment_server::general:common_images

Daemonset

mw-mcrouter is running as a daemonset, i.e. every k8s node is running an instance of it. This includes the dedicated kubernetes nodes running kask, thus there mw-mcrouter pods with no traffic at all.

Service

mw-mcrouter is using the mcrouter chart. Notable keys in values.yaml:

cache:mcrouter:public_service: true enables mcrouter as a standalone service
service:use_node_local_endpoints: true routes requests to the node-local endpoint of a pod
cache:mcrouter:service:clusterIPStatic IP (per DC) where the service listens, as defined in Kubernetes/Service_ips
- eqiad ClusterIP: 10.64.72.12
- codfw ClusterIP: 10.192.72.12

values-eqiad.yaml

cache:
  mcrouter:
    service:
      clusterIP: 10.64.72.12
      enabled: true
    route_prefix: eqiad/mw
    zone: eqiad
    routes:
      - route: /eqiad/mw
        pool: eqiad-servers
        failover_time: 600
      - route: /codfw/mw
        pool: codfw-servers
        failover_time: 600
      - route: /eqiad/mw-wan
        failover_time: 600
        pool: eqiad-servers
        replica:
          route: /codfw/mw-wan
          pool: codfw-servers

Deployment

In the mcrouter chart, in daemonset.yaml, this ds is configured to update one pod at a time.

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1

During an mw-mcrouter deployment:

Generally, mcrouter's configuration or image are rarely in need for an update
In case of new image being pulled, deployment may take as long as 15'
- This is due to maxUnavailable: 1 above
Alerts for elevated mw-memcached errors
Alerts for mw-mcrouter helmfile being in a bad state
All the alerts above will clear
If deployment is stuck due to eg a node having insufficient resources to host the mcrouter

Testing changes

The safest way to test changes in mcrouter is to switch the mw-debug mediawiki deployment to use the in-pod mcrouter container. This is described in the next section

Switching Mediawiki to in-pod mcrouter container

While it sounds complicated, it is not. To switch mw-debug in eqiad to use the in-pod container, the following stanza must be added:

mw-debug/values-eqiad.yaml

cache:
  mcrouter:
    enabled: true
    route_prefix: eqiad/mw
    zone: eqiad
    routes:
      - route: /eqiad/mw
        pool: eqiad-servers
        failover_time: 600
      - route: /codfw/mw
        pool: codfw-servers
        failover_time: 600
      - route: /eqiad/mw-wan
        pool: eqiad-servers
        failover_time: 600
        replica:
          route: /codfw/mw-wan
          pool: codfw-servers

# Wikifunctions routes, omitted in production
#      - route: /local/wf
#        pool: wf-eqiad
#        # No failover for wikifunction
#        failover_time: 0
#
# use only if testing new images
# common_images:
#   mcrouter:
#     mcrouter: mcrouter:2023.07.17.00-1-20240714
#     exporter: prometheus-mcrouter-exporter:0.0.1-3-20240714

php:
  envvars:
    MCROUTER_SERVER: "127.0.0.1:11213"
    STATSD_EXPORTER_PROMETHEUS_SERVICE_HOST: false
#   MCROUTER_SERVER: "10.64.72.12:4442" # mcrouter-main.mw-mcrouter.svc.cluster.local

Troubleshooting

When dealing with Kubernetes, your answer may be found in Kubernetes kubectl Cheat Sheet

Memcached server is down

That is ok, the gutter pool will pick up its traffic.

Memcached Gutter Pool server is down, and we need the Gutter Pool

In this case:

the gutter pool server in question MUST be removed from the configuration
Merge in puppet
Run puppet on the active deployment server
Deploy mw-mcrouter

Deployment is stalled

If you are deploying a new version of the daemonset, but you see pods stuck in the previous version, and elevate mw-memcached errors:

Check the daemonset's status

jiji@deploy1002:~$ kubectl get ds
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
mcrouter-main   210       210       210     210          210         <none>          30d

Check events and pods to find which node may be stalling the rollout
- If it is a resource problem (e.g. insufficient CPU), you may kill a random pod from the node in question (as root)

kube_env admin eqiad
kubectl -n mw-mcrouter get events --sort-by=.metadata.creationTimestamp
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=wikikube-worker1001.eqiad.wmnet
kubectl -n mw-api-ext delete po mw-api-ext.eqiad.main-koko-lala

Dashboards

Links