Application servers/Runbook

From Wikitech

Load-balancing pooling and depooling

Always run scap pull on an appserver before pooling

From cumin

# Pooling a server
sudo confctl select dc=$datacenter,cluster=$appserver_cluster,name=$host set/pooled=yes
# Depooling a server
sudo confctl select dc=$datacenter,cluster=$appserver_cluster,name=$host set/pooled=no
# Depool a server and remove it from software distribution
sudo confctl select dc=$datacenter,cluster=$appserver_cluster,name=$host set/pooled=inactive
# Set a server's weight
sudo confctl select dc=$datacenter,cluster=$appserver_cluster,name=$host set/weight=$weight
# All-in-one
sudo confctl select dc=$datacenter,cluster=$appserver_cluster,name=$host set/pooled=yes:weight=$weight

From the host itself

# Pooling a server
scap pull; sudo -i pool
# Depooling a server
sudo -i depool

Useful dashboards

Apache

Testing config

  • Choose one of the mediawiki debug servers. Then, on that server:
    • Disable puppet: sudo disable-puppet 'insert reason'
    • Apply change locally under /etc/apache2/sites-enabled/
    • sudo apache2ctl restart
  • Test your change by making relevant HTTP request. See Debugging in production for how.
  • When you're done, sudo enable-puppet 'insert reason'

Deploying config

It is suggested that you may wish to place any configuration updates on the Deployments page. A bad configuration going live can easily result in a site outage.

  • Test your change in deployment-prep and make sure that it works as expected.
  • In the operations/puppet repository, make your change in the modules/mediawiki/files/apache/sites directory.
  • In the same commit, add one or more httpbb tests in the modules/profile/files/httpbb directory, asserting that your change works as you intend. (Consider automating the same checks you just performed by hand.)
    • For example, if you are adding or modifying a RewriteRule, please add tests covering some URLs that are expected to change.
  • On deploy1001, run all httpbb tests on an affected host. Neither your changes nor your new tests are in effect yet, so any test failures are unrelated. All tests are expected to pass -- if they don't, you should track down and fix the problem before continuing.
    rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/appserver/* --host mwdebug1001.eqiad.wmnet
    Sending to mwdebug1001.eqiad.wmnet...
    PASS: 99 requests sent to mwdebug1001.eqiad.wmnet. All assertions passed.
    
  • Submit your change and tests to gerrit as a single commit.
  • Disable puppet across the affected mediawiki application servers.
    • Cumin can in finding the precise set of hosts. For example, this is a recent query:
      cumin 'R:File = "/etc/apache2/sites-available/04-remnant.conf"' 'disable-puppet "elukey - precaution for https://gerrit.wikimedia.org/r/#/c/380774/"' -b 10
      
      In this case the change was related to a RewriteRule change in 04-remnant.conf, but of course it must be changed every time with the file(s) modified by the Gerrit change.
  • Merge via gerrit and run on puppetmaster1001 the usual puppet-merge
  • Go to one of the mwdebug servers and enable/run puppet. Apache will reload its configuration automatically, please check that no error messages are emitted. Running apachectl -t after running puppet surely helps verifying that the new configuration is syntactically correct (it doesn't absolutely imply that it will work as intended of course).
    • Some Apache directive changes need a full restart to get applied, not a simple reload. These changes are very rare and they are clearly indicated in Apache's documentation, so please verify it beforehand. Simple RewriteRule changes require only an Apache reload.
  • On deploy1001, re-run all httpbb tests on an affected host. Your new tests verify that your intended change is functioning correctly, and re-running the old tests verifies that existing behavior wasn't inadvertently changed in the process. All tests are expected to pass -- if they don't, you should revert your change.
    rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/* --host mwdebug1001.eqiad.wmnet
    Sending to mwdebug1001.eqiad.wmnet...
    PASS: 101 requests sent to mwdebug1001.eqiad.wmnet. All assertions passed.
    
  • Enable/Run puppet on another mediawiki application server that is taking traffic, de-pooling it beforehand via confctl. Verify again from deploy1001 that everything is working as expected, running httpbb.
  • Repool the host mentioned above and verify on Apache access logs that everything looks fine. If you want to be extra paranoid, you can check the host level metrics via https://grafana.wikimedia.org/d/000000327/apache-fcgi?orgId=1 and make sure that nothing is out of the ordinary.
  • Re-enable puppet across the appservers previously disabled via cumin.
  • Keep an eye on the operations channel and make sure that puppet runs fine on these hosts.

Apache logs

You can find apache's request log at /var/log/apache2/other_vhosts_access.log

Mcrouter

Mcrouter never breaks (TM)

MySQL

See Debugging in production#Debugging databases.

Envoy

Envoy is used for:

  • TLS termination: envoy listens on 443 and proxy passes the request to apache listening on 80
  • Services proxy: for proxying calls from MediaWiki to external services

It's a resilient service, and it should not fail usually. Some quick pointers:

  • Logs are under /var/log/envoy.
  • /var/log/envoy/syslog.log (or sudo journalctl -u envoyproxy.service) to see the daemon logs
  • Verify that configuration is valid: sudo -u envoy /usr/bin/envoy -c /etc/envoy/envoy.yaml --mode validate.
  • Envoy uses a hot restarter that allows seamless restarts without losing a request. Use systemctl reload envoyproxy.service unless you really know why that wouldn't work.
  • You can check the status of envoy and much other info under http://localhost:9631. Of specific utility is /stats which returns current stats. Refer to the admin interface docs for details.

If you see an error about runtime variables being set, you can check the runtime config via curl http://localhost:9631/runtime. Reloading envoy (e.g. resetting the runtime config) should solve the alert in a few minutes.

PHP 7

PHP 7 is the interpreter we use for serving MediaWiki. This page collects resources about how to troubleshoot and fix some potential issues with it. For more general information about how we serve MediaWiki in production, refer to Application servers.

Logging from PHP

The php-fpm daemon sends its own logs via Rsyslog to Kafka/Logstash under type:syslog program:php7.4-fpm. These are also stored locally on each app server under /var/log/php7.4-fpm/.

The php-fpm deamon also maintains a "slow request" log that can be found at /var/log/php7.4-fpm-www-7.4-slowlog.log

The MediaWiki application sends its logs directly to Rsyslog at localhost:10514 (per wmf-config/logging.php) and are forwarded from there to Kafka/Logstash under type:mediawiki. To inspect these on the network in transport, you can tail the network packets on any given appserver via sudo tcpdump -i any -l -A -s0 port 10514.

Any other bare syslog() calls in PHP, such as from php-wmerrors, also end up Logstash under type:mediawiki. These don't go to the local Rsyslog port but rather go to the kernel directly, and are then forwarded to Kafka/Logsatsh. You can inspect those on their way out via sudo tcpdump -i any -l -A -s0 port 8420. Note that this will also include both MediaWiki's structured logs and PHP syslog calls.

Dashboards

Debugging procedures and tools

php7adm

php7adm is a tool that allows to interact with the local php-fpm daemon to gather information on the running status of the application. Its usage is pretty simple:

$ php7adm [OPTION]

To see a list of available actions, just run the command without arguments:

$ php7adm 
Supported urls:

  /metrics         Metrics about APCu and OPcache usage
  /apcu-info       Show basic APCu stats
  /apcu-meta       Dump meta information for all objects in APCu to /tmp/apcu_dump_meta
  /apcu-free       Clear all data from APCu
  /opcache-info    Show basic opcache stats
  /opcache-meta    Dump meta information for all objects in opcache to /tmp/opcache_dump_meta
  /opcache-free    Clear all data from opcache

All data, apart from the /metrics endpoint, are reported in json format.

Low-level debugging

Php-fpm is a prefork style appserver, which means that every child process will be serving just one request at a time. So attaching with strace to an individual process should give you a lot of information on what is going on there. We still don't have an automated dumper of stacktraces from php-fpm, but you can use as usual quickstack for a quick peek at the stacktraces, or gdb for more details.

Response to common alerts

Average latency exceeded

This alert means something is currently very wrong, and MediaWiki is responding to clients at unusually slow pace. This can be due to a number of reasons, but typically a slowness of response

from all servers means some backend system is responding slowly. A typical troubleshooting should go as follows:

  • Check the application server RED dashboard in the panels "mcrouter" and "databases" to quickly see if anything stands out
  • Check SAL for any deployments corresponding to the time of the alert or a few minutes earlier. If there is any, request a rollback while you keep debugging. Worst case scenario, the changes will have to be deployed again, but in many cases you'll have the resolution of the outage.
  • ssh to one server in the cluster that is experiencing the issue. Check the last entries in the php-fpm slowlog (located at /var/log/php7.4*-slowlog.log) If all requests you see popping up are blocked in a specific function, that should give you a pointer to what isn't working: caches, databases, backend services
  • For databases go check the slow query dashboard on logstash
  • For caches, you can go check the memcached dashboards on grafana.
  • For curl requests, you can check the envoy telemetry dashboard - set the origin cluster to the cluster where you're seeing latency (excluding local_port_XX which is pointing to the local appserver)
  • If none of the above works, escalate the problem to the wider team

PHP7 rendering

This alert comes from trying to render a page on enwiki using php7 (not HHVM). Since the request goes through apache httpd, first check if apache is alerting as well, then look at opcache alerts. If there is a critical alert on opcache too, look at the corresponding section below. If only the php7 rendering is alerting, check the following:

  • What does the php-fpm log say? Any specific errors repeating right now?
 
$ tail -f /var/log/php7.4-fpm/error.log | fgrep -v '[NOTICE]'
Jun  0 00:00:00 server php7.4-fpm[pid]: [WARNING] [pool www] child PID, script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' (request: "GET /wiki/Special:Random") executing too slow (XX.XX sec), logging
...

For example, if you see a lot of slow requests from a specific PID, it might be interesting to strace it. If some strange and unique error message is present, probably the opcache is corrupted. In that case confirm by resetting opcache and verifying the problem supersedes. Search if we have an open ticket about opcache corruptions and register the occurrence there.

  • What can you see looking at non-200 responses that come from php-fpm in the apache log? Any trend? anything stands out?
# This will just show 5xx errors, nothing else.
 $ tail -n100000 -f /var/log/apache2/other_vhosts_access.log | fgrep fcgi://localhost/5
  • If nothing conclusive comes out of it, you can still probe the processes with the usual debugging tools. In that case, depool the server for good measure
$ sudo -i depool

IMPORTANT: Remember to repool the server afterwards.

If this happens once, and on just one server, I suggest to just restart php-fpm

$ sudo -i /usr/local/sbin/restart-php7.4-fpm

and watch the logs/icinga to see if the issue superseeded. If the issue is on more than just one server, escalate to the SRE team responsible for the service.

PHP7 Too Busy, not enough idle workers

This means that we don't have enough idle workers in the mentioned cluster (api/appservers), causes request queuing and therefore user-visible latency.

If the idle worker pool is exhausted, an incident affecting all wikis will ensue, with possible domino effects. It can happen due to a variety of factors, included bad MediaWiki releases, bad configuration, problems with reached out to services or backends (memcache, databases, other APIs, etc). Highly dependent on traffic load too.

  • Check the application server RED dashboard in the panels "mcrouter" and "databases" to quickly see if anything stands out
  • Check SAL for any deployments corresponding to the time of the alert or a few minutes earlier. If there is any, request a rollback while you keep debugging. Worst case scenario, the changes will have to be deployed again, but in many cases you'll have the resolution of the outage.
  • ssh to one server in the cluster that is experiencing the issue. Check the last entries in the php-fpm slowlog (located at /var/log/php7.4*-slowlog.log) If all requests you see popping up are blocked in a specific function, that should give you a pointer to what isn't working: caches, databases, backend services
  • For databases go check the slow query dashboard on logstash
  • For caches, you can go check the memcached dashboards on grafana.
  • For curl requests, you can check the envoy telemetry dashboard - set the origin cluster to the cluster where you're seeing latency (excluding local_port_XX which is pointing to the local appserver)
  • If this is a problem in the MW-on-k8s cluster specific to Wikifunctions, it may be due to abuse there; in an emergency, you can follow the runbook to disable function execution there.
  • If none of the above works, escalate the problem to the wider team

Videoscalers

For historical reasons, videoscalers and jobrunners used to be the same cluster. At the time of this writing (2024-04-15) this is no longer true. Jobrunners have moved to MediaWiki on Kubernetes. For a while, as we fix docs and alerts, you might see out of date information or links.

Sometimes their performance is impacted by an overwhelming amount of video encodes. Quick diagnostic: many (100+) ffmpeg processes running on a videoscaler server, icinga checks timing out, etc.

You should also look at the overall health of the jobrunner server group (not split by jobrunner vs videoscaler).

video scalar host during an incident

You can log into the host and kill any remaining ffmpeg processes (sudo pkill ffmpeg). The job queue should automatically retry them later.

Monitoring:

  • [1] look for Job run duration p99.
  • [2] Host profile for a videoscaler host during incident.

Tasks:

  • [3] April 12 2021 video uploads
  • [4]] Videoscaler Overload Incident

scap proxy and canary

Proxy

A scap proxy is an intermediate rsync proxy between the deployment host and the rest of the production infrastructure.

You can find a list of them in hieradata/common/scap/dsh.yaml

Canary

A canary is one of the first hosts to have new code deployed via scap. It is checked by scap for its error rate, and scap auto-aborts the deployment if it is too high.

To list them: ssh cumin1001.eqiad.wmnet confctl select service=canary get

Service Ops

Adding a new server into production

  • Create DNS patch to assign IP addresses to them. This is usually done by dcops nowadays but they might want your review for it. (example change)
  • Create a puppet patch that adds the servers with the right regexes in site.pp. Apply the spare::system puppet role. (example change)
  • Decide which role this server should have (appserver, API appserver, jobrunner,..). Use Netbox to search for the host and see which rack it is in. Try to balance server roles across both racks and rows.
  • Create a puppet patch that adds the proper role to the servers and adds them in conftool-data in the right section. Don't merge it yet. (example change)
  • Schedule Icinga downtimes for your new hosts for 1h. ex: dzahn@cumin1001:~$ sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 mw13[63,74-83].eqiad.wmnet
  • Merge the patch to add puppet roles to the new servers.
  • Force a puppet run via cumin. Some errors are normal in the first puppet run. ex: dzahn@cumin1001:~$ sudo -i cumin -b 15 'mw13[63,74-83].eqiad.wmnet' 'run-puppet-agent -q'
  • Force a second puppet run via cumin. It should complete successfully.
  • Run downtime with force-puppet-run via cumin ex: sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 --force-puppet mw13[63,74-83].eqiad.wmnet
  • Run a restart of all apache2 processes ex: sudo cumin mw24[20-51].codfw.wmnet 'systemctl restart apache2'
  • Watch all (new) Icinga alerts on the hosts turn green, make sure Apache does not have to be restarted. You can "reschedule next service check" in the Icinga web UI to speed things up. It is expected that the "not in dsh group" alert stays CRIT until the server is pooled below. Once all alerts besides that one are green (not PENDING and not CRIT) it is ok to go ahead.
  • Check for ongoing deployments. Wait if that is the case. You can use "jouncebot: now" on IRC, or check the Deployments page.
  • Run "scap pull" on new servers to ensure latest MediaWiki version deployed is present.
  • Give the server a weight with confctl: ex: [cumin1001:~] $ sudo -i confctl select name=mw1355.eqiad.wmnet set/weight=30
  • Pool the server with confctl: ex: [cumin1001:~] $ sudo -i confctl select name=mw1353.eqiad.wmnet set/pooled=yes
  • Watch Grafana Host Overview, select server and see it is getting traffic.

Spreading application servers out across rows and racks

We aim to spread out application server roles (regular appserver, API appserver, etc) across both rows (ex. B) as well as racks (ex. B3) in each of the main data centers (currently eqiad and codfw). When an entire rack or entire row fails, this distribution of hosts minimizes the impact to any single role.

Removing old appservers from production (decom)

  • Identify servers you want to decom in netbox. The procurement ticket linked from there tells you the purchase date to see how old they are.
  • Create a Gerrit patch that removes the servers from site.pp and conftool-data. (example change) but don't merge it yet.
  • Set the servers to 'pooled=no' and watch in Grafana how they stop serving traffic, temperature goes down etc. ex: [cumin1001:~] $ sudo -i confctl select 'name=mw123[2-5].eqiad.wmnet' set/pooled=no
  • If needed, make and deploy any mediawiki-config changes
  • Use the downtime cookbook to schedule monitoring downtimes for the servers. Give a reason and link to your decom ticket. ex: [cumin1001:~] $ sudo cookbook sre.hosts.downtime -r decom -t T247780 -H 2 mw125[0-3].eqiad.wmnet.
  • If everything seems fine, set the servers to 'pooled=inactive' now. ex: [cumin1001:~] $ sudo -i confctl select 'name=mw125[0-3].eqiad.wmnet' set/pooled=inactive
  • If you are sure, run the actual decom cookbook now. This step is destructive so you will have to reinstall servers to revert. ex: [cumin1001:~] $ sudo cookbook sre.hosts.decommission mw125[0-3].eqiad.wmnet -t T247780
  • Merge your prepared puppet change to remove them from site and conftool-data.
  • optional: Run puppet on Icinga and see the servers and services on them disappear from monitoring.
  • optional: Confirm in Netbox the state of the servers is "decommissioning" now.
  • Check for any other occurrences of the hostnames in the puppet repo.
  • Check if any of the servers was a scap proxy (hieradata/common/scap/dsh.yaml). Remove if needed. (example change)
  • Hand over the decom ticket to dcops for physical unracking and the final steps in the server lifecycle.