Jump to content

Application servers/Runbook

From Wikitech

Useful dashboards

Apache

Testing config

  • Choose one of the mediawiki debug servers. Coordinate in #wikimedia-operations to ensure no one else is using the same debug server to test something else, such as another config change or a MediaWiki deployment.
  • Then, on that server:
    • Disable puppet: sudo disable-puppet 'insert reason'
    • Apply change locally under /etc/apache2/sites-enabled/
    • sudo apache2ctl restart
  • Test your change by making relevant HTTP request. See Debugging in production for how.
  • When you're done, sudo enable-puppet 'insert reason' (using the same reason string as before).

Deploying config

It is suggested that you may wish to place any configuration updates on the Deployments page. A bad configuration going live can easily result in a site outage.

  • Test your change in deployment-prep and make sure that it works as expected.
  • In the operations/puppet repository, make your change in the modules/mediawiki/files/apache/sites directory.
  • In the same commit, add one or more httpbb tests in the modules/profile/files/httpbb directory, asserting that your change works as you intend. (Consider automating the same checks you just performed by hand.)
    • For example, if you are adding or modifying a RewriteRule, please add tests covering some URLs that are expected to change.
  • On the deployment host, run all httpbb tests on an affected host. Neither your changes nor your new tests are in effect yet, so any test failures are unrelated. All tests are expected to pass -- if they don't, you should track down and fix the problem before continuing.
    rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/appserver/* --host mwdebug.discovery.wmnet --https_port 4444
    Sending to mwdebug.discovery.wmnet...
    PASS: 147 requests sent to mwdebug.discovery.wmnet. All assertions passed.
    
  • Submit your change and tests to gerrit as a single commit.
  • Merge via gerrit and run on puppetserver1001 the usual puppet-merge
  • Enable/run puppet on the deployment host. This will deploy the new httpbb tests.
  • On the deployment host, re-run all httpbb tests. This will verify that your new test is failing. This is intended so you can check after deploying the configuration change that your test is now passing.
rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/appserver/* --host mwdebug.discovery.wmnet --https_port 4444
  • Re-run your tests against mw-debug. Your new tests verify that your intended change is functioning correctly, and re-running the old tests verifies that existing behavior wasn't inadvertently changed in the process. All tests are expected to pass -- if they don't, you should revert your change.
    rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/appserver/* --host mwdebug.discovery.wmnet --https_port 4444
    Sending to mwdebug.discovery.wmnet...
    PASS: 148 requests sent to mwdebug.discovery.wmnet. All assertions passed.
    
  • Continue the scap deployment to the rest of MediaWiki_On_Kubernetes
  • On a cumin host, re-run all the httpbb hourly tests to verify everything is fine
systemctl restart httpbb_kubernetes_mw-api-ext_hourly.service
systemctl restart httpbb_kubernetes_mw-api-int_hourly.service
systemctl restart httpbb_kubernetes_mw-jobrunner_hourly.service
systemctl restart httpbb_kubernetes_mw-parsoid_hourly.service
systemctl restart httpbb_kubernetes_mw-web_hourly.service
  • Keep an eye on the operations channel and make sure that puppet runs fine on these hosts.

Apache logs

You can find apache's request log at /var/log/apache2/other_vhosts_access.log

Mcrouter never breaks™️, Memcached never breaks too™️. Except from when they do.

MySQL

See Debugging in production#Debugging databases.

Envoy

Envoy is used for:

  • TLS termination: envoy listens on 443 and proxy passes the request to apache listening on 80
  • Services proxy: for proxying calls from MediaWiki to external services

It's a resilient service, and it should not fail usually. Some quick pointers:

  • Logs are under /var/log/envoy.
  • /var/log/envoy/syslog.log (or sudo journalctl -u envoyproxy.service) to see the daemon logs
  • Verify that configuration is valid: sudo -u envoy /usr/bin/envoy -c /etc/envoy/envoy.yaml --mode validate.
  • Envoy uses a hot restarter that allows seamless restarts without losing a request. Use systemctl reload envoyproxy.service unless you really know why that wouldn't work.
  • You can check the status of envoy and much other info under http://localhost:9631. Of specific utility is /stats which returns current stats. Refer to the admin interface docs for details.

If you see an error about runtime variables being set, you can check the runtime config via curl http://localhost:9631/runtime. Reloading envoy (e.g. resetting the runtime config) should solve the alert in a few minutes.

PHP 8

PHP 8 is the interpreter we use for serving MediaWiki. This page collects resources about how to troubleshoot and fix some potential issues with it. For more general information about how we serve MediaWiki in production, refer to Application servers.

Logging from PHP

The php-fpm daemon sends its own logs via Rsyslog to Kafka/Logstash under type:syslog program:php7.4-fpm. These are also stored locally on each app server under /var/log/php7.4-fpm/.

The php-fpm deamon also maintains a "slow request" log that can be found at /var/log/php7.4-fpm-www-7.4-slowlog.log

The MediaWiki application sends its logs directly to Rsyslog at localhost:10514 (per wmf-config/logging.php) and are forwarded from there to Kafka/Logstash under type:mediawiki. To inspect these on the network in transport, you can tail the network packets on any given appserver via sudo tcpdump -i any -l -A -s0 port 10514.

Any other bare syslog() calls in PHP, such as from php-wmerrors, also end up Logstash under type:mediawiki. These don't go to the local Rsyslog port but rather go to the kernel directly, and are then forwarded to Kafka/Logsatsh. You can inspect those on their way out via sudo tcpdump -i any -l -A -s0 port 8420. Note that this will also include both MediaWiki's structured logs and PHP syslog calls.

Dashboards

Debugging procedures and tools

php7adm

php7adm is a tool that allows to interact with the local php-fpm daemon to gather information on the running status of the application. Its usage is pretty simple:

$ php7adm [OPTION]

To see a list of available actions, just run the command without arguments:

$ php7adm 
Supported urls:

  /metrics         Metrics about APCu and OPcache usage
  /apcu-info       Show basic APCu stats
  /apcu-meta       Dump meta information for all objects in APCu to /tmp/apcu_dump_meta
  /apcu-free       Clear all data from APCu
  /opcache-info    Show basic opcache stats
  /opcache-meta    Dump meta information for all objects in opcache to /tmp/opcache_dump_meta
  /opcache-free    Clear all data from opcache

All data, apart from the /metrics endpoint, are reported in json format.

Low-level debugging

Php-fpm is a prefork style appserver, which means that every child process will be serving just one request at a time. So attaching with strace to an individual process should give you a lot of information on what is going on there. We still don't have an automated dumper of stacktraces from php-fpm, but you can use as usual quickstack for a quick peek at the stacktraces, or gdb for more details.

Response to common alerts

Average latency exceeded

This alert means something is currently very wrong, and MediaWiki is responding to clients at unusually slow pace. This can be due to a number of reasons, but typically a slowness of response

from all servers means some backend system is responding slowly. A typical troubleshooting should go as follows:

  • Check the application server RED - k8s dashboard in the panels "mcrouter" and "databases" to quickly see if anything stands out
  • Check SAL for any deployments corresponding to the time of the alert or a few minutes earlier. If there is any, request a rollback while you keep debugging. Worst case scenario, the changes will have to be deployed again, but in many cases you'll have the resolution of the outage.
  • ssh to one server in the cluster that is experiencing the issue. Check the last entries in the php-fpm slowlog (located at /var/log/php7.4*-slowlog.log) If all requests you see popping up are blocked in a specific function, that should give you a pointer to what isn't working: caches, databases, backend services
  • For databases go check the slow query dashboard on logstash
  • For caches, you can go check the memcached dashboards on grafana.
  • For curl requests, you can check the envoy telemetry dashboard - set the origin cluster to the cluster where you're seeing latency (excluding local_port_XX which is pointing to the local appserver)
  • If none of the above works, escalate the problem to the wider team

PHP7 rendering

This alert comes from trying to render a page on enwiki using php7 (not HHVM). Since the request goes through apache httpd, first check if apache is alerting as well, then look at opcache alerts. If there is a critical alert on opcache too, look at the corresponding section below. If only the php7 rendering is alerting, check the following:

  • What does the php-fpm log say? Any specific errors repeating right now?
 
$ tail -f /var/log/php7.4-fpm/error.log | fgrep -v '[NOTICE]'
Jun  0 00:00:00 server php7.4-fpm[pid]: [WARNING] [pool www] child PID, script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' (request: "GET /wiki/Special:Random") executing too slow (XX.XX sec), logging
...

For example, if you see a lot of slow requests from a specific PID, it might be interesting to strace it. If some strange and unique error message is present, probably the opcache is corrupted. In that case confirm by resetting opcache and verifying the problem supersedes. Search if we have an open ticket about opcache corruptions and register the occurrence there.

  • What can you see looking at non-200 responses that come from php-fpm in the apache log? Any trend? anything stands out?
# This will just show 5xx errors, nothing else.
 $ tail -n100000 -f /var/log/apache2/other_vhosts_access.log | fgrep fcgi://localhost/5
  • If nothing conclusive comes out of it, you can still probe the processes with the usual debugging tools. In that case, depool the server for good measure
$ sudo -i depool

IMPORTANT: Remember to repool the server afterwards.

If this happens once, and on just one server, I suggest to just restart php-fpm

$ sudo -i /usr/local/sbin/restart-php7.4-fpm

and watch the logs/icinga to see if the issue superseeded. If the issue is on more than just one server, escalate to the SRE team responsible for the service.

PHP7 Too Busy, not enough idle workers

This means that we don't have enough idle workers in the mentioned cluster (api/appservers), causes request queuing and therefore user-visible latency.

If the idle worker pool is exhausted, an incident affecting all wikis will ensue, with possible domino effects. It can happen due to a variety of factors, included bad MediaWiki releases, bad configuration, problems with reached out to services or backends (memcache, databases, other APIs, etc). Highly dependent on traffic load too.

  • Check the application server RED dashboard in the panels "mcrouter" and "databases" to quickly see if anything stands out
  • Check SAL for any deployments corresponding to the time of the alert or a few minutes earlier. If there is any, request a rollback while you keep debugging. Worst case scenario, the changes will have to be deployed again, but in many cases you'll have the resolution of the outage.
  • ssh to one server in the cluster that is experiencing the issue. Check the last entries in the php-fpm slowlog (located at /var/log/php7.4*-slowlog.log) If all requests you see popping up are blocked in a specific function, that should give you a pointer to what isn't working: caches, databases, backend services
  • For databases go check the slow query dashboard on logstash
  • For caches, you can go check the memcached dashboards on grafana.
  • For curl requests, you can check the envoy telemetry dashboard - set the origin cluster to the cluster where you're seeing latency (excluding local_port_XX which is pointing to the local appserver)
  • If this is a problem in the MW-on-k8s cluster specific to Wikifunctions, it may be due to abuse there; in an emergency, you can follow the runbook to disable function execution there.
  • If none of the above works, escalate the problem to the wider team

Videoscalers

For historical reasons, videoscalers and jobrunners used to be the same cluster. At the time of this writing (2024-04-15) this is no longer true. Jobrunners have moved to MediaWiki on Kubernetes. For a while, as we fix docs and alerts, you might see out of date information or links.

Sometimes their performance is impacted by an overwhelming amount of video encodes. Quick diagnostic: many (100+) ffmpeg processes running on a videoscaler server, icinga checks timing out, etc.

You should also look at the overall health of the jobrunner server group (not split by jobrunner vs videoscaler).

video scalar host during an incident

You can log into the host and kill any remaining ffmpeg processes (sudo pkill ffmpeg). The job queue should automatically retry them later.

Monitoring:

  • [1] look for Job run duration p99.
  • [2] Host profile for a videoscaler host during incident.

Tasks:

  • [3] April 12 2021 video uploads
  • [4]] Videoscaler Overload Incident