Application servers/Runbook

From Wikitech
Jump to navigation Jump to search

Apache/nginx

TODO.

HHVM

TODO. However we hope to be able to just remove this soon!

Mcrouter

TODO.

PHP 7

PHP 7 is the interpreter we use for serving mediawiki. This page collects resources about how to troubleshoot and fix some potential issues with it. For more general information about how we serve mediawiki in production see the page about Application servers

Logging

Dashboards

Debugging procedures and tools

php7adm

php7adm is a tool that allows to interact with the local php-fpm daemon to gather information on the running status of the application. Its usage is pretty simple:

$ php7adm [OPTION]

To see a list of available actions, just run the command without arguments:

$ php7adm 
Supported urls:

  /metrics         Metrics about APCu and OPcache usage
  /apcu-info       Show basic APCu stats
  /apcu-meta       Dump meta information for all objects in APCu to /tmp/apcu_dump_meta
  /apcu-free       Clear all data from APCu
  /opcache-info    Show basic opcache stats
  /opcache-meta    Dump meta information for all objects in opcache to /tmp/opcache_dump_meta
  /opcache-free    Clear all data from opcache

All data, apart from the /metrics endpoint, are reported in json format.

Low-level debugging

Php-fpm is a prefork style appserver, which means that every child process will be serving just one request at a time. So attaching with strace to an individual process should give you a lot of information on what is going on there. We still don't have an automated dumper of stacktraces from php-fpm, but you can use as usual quickstack for a quick peek at the stacktraces, or gdb for more details.

Response to common alerts

PHP7 opcache health

This alert can arise in three different scenarios:

  1. The opcache is full
  2. The opcache has a too low cache hit ratio
  3. The opcache has little free space

It's quite possible multiple servers get the same alert at the same time. That's because what uses up opcache is deployments, so the opcache usage goes hand in hand for servers that have been restarted before the same deployment. If you want to know more about what's going on, you can fetch the info yourself:

$ php7adm /opcache-info | jq .
{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
...

While we should have a cron checking for these conditions and doing the work for us sooner than later, you can still fix the issue by safely doing a restart of php-fpm:

$ sudo -i /usr/local/sbin/restart-php7.2-fpm

Be careful If you're restarting multiple servers this way, as restart-php7.2-fpm depools the server, restarts php-fpm, then repools the server. You should never run restart on more than 10% of the servers in a cluster at the same time.

PHP7 rendering

This alert comes from trying to render a page on enwiki using php7 (not HHVM). Since the request goes through apache httpd, first check if apache is alerting as well. In that case, something else might be at play (like, HHVM being stuck and using up all of httpd's connection slots). Then look at opcache alerts. If there is a critical alert on opcache too, look at the corresponding section below. If only the php7 rendering is alerting, check the following:


  • What does the php-fpm log say? Any specific errors repeating right now?
 
$ tail -f /var/log/php7.2-fpm/error.log | fgrep -v '[NOTICE]'
Jun  0 00:00:00 server php7.2-fpm[pid]: [WARNING] [pool www] child PID, script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' (request: "GET /wiki/Special:Random") executing too slow (XX.XX sec), logging
...

For example, if you see a lot of slow requests from a specific PID, it might be interesting to strace it. If some strange and unique error message is present, probably the opcache is corrupted. In that case confirm by resetting opcache and verifying the problem supersedes. Search if we have an open ticket about opcache corruptions and register the occurrence there.


  • What can you see looking at non-200 responses that come from php-fpm in the apache log? Any trend? anything stands out?
# This will just show 5xx errors, nothing else.
 $ tail -n100000 -f /var/log/apache2/other_vhosts_access.log | fgrep fcgi://localhost/5
  • If nothing conclusive comes out of it, you can still probe the processes with the usual debugging tools. In that case, depool the server for good measure
$ sudo -i depool

IMPORTANT: remember to also repool it afterwards.


If this happens once, and on just one server, I suggest to just restart php-fpm

$ sudo -i /usr/local/sbin/restart-php7.2-fpm

and watch the logs/icinga to see if the issue superseeded. If the issue is on more than just one server, escalate to the SRE team responsible for the service.