Application servers/Runbook

From Wikitech
Jump to navigation Jump to search
Wikimedia infrastructure

Data centres and PoPs

Networking

HTTP Caching

MediaWiki


Media

Logs

Apache

Testing config

  • Choose one of the mediawiki debug servers. Then, on that server:
    • Disable puppet: sudo puppet agent --disable 'insert reason'
    • Apply change locally under /etc/apache2/sites-enabled/
    • sudo apache2ctl restart
  • Test your change by making relevant HTTP request. See Debugging in production for how.
  • When you're done, sudo puppet agent --enable

Deploying config

It is suggested that you may wish to place any configuration updates on the Deployments page. A bad configuration going live can easily result in a site outage.

  • Test your change in deployment-prep and make sure that it works as expected.
  • In the operations/puppet repository, make your change in the modules/mediawiki/files/apache/sites directory.
  • In the same commit, add one or more httpbb tests in the modules/profile/files/httpbb directory, asserting that your change works as you intend. (Consider automating the same checks you just performed by hand.)
    • For example, if you are adding or modifying a RewriteRule, please add tests covering some URLs that are expected to change.
  • On deploy1001, run all httpbb tests on an affected host. Neither your changes nor your new tests are in effect yet, so any test failures are unrelated. All tests are expected to pass -- if they don't, you should track down and fix the problem before continuing.
    rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/* --host mwdebug1001.eqiad.wmnet
    Sending to mwdebug1001.eqiad.wmnet...
    PASS: 99 requests sent to mwdebug1001.eqiad.wmnet. All assertions passed.
    
  • Submit your change and tests to gerrit as a single commit.
  • Disable puppet across the affected mediawiki application servers.
    • Cumin can in finding the precise set of hosts. For example, this is a recent query:
      cumin 'R:File = "/etc/apache2/sites-available/04-remnant.conf"' 'disable-puppet "elukey - precaution for https://gerrit.wikimedia.org/r/#/c/380774/"' -b 10
      
      In this case the change was related to a RewriteRule change in 04-remnant.conf, but of course it must be changed every time with the file(s) modified by the Gerrit change.
  • Merge via gerrit and run on puppetmaster1001 the usual puppet-merge
  • Go to one of the mwdebug servers and enable/run puppet. Apache will reload its configuration automatically, please check that no error messages are emitted. Running apachectl -t after running puppet surely helps verifying that the new configuration is syntactically correct (it doesn't absolutely imply that it will work as intended of course).
    • Some Apache directive changes need a full restart to get applied, not a simple reload. These changes are very rare and they are clearly indicated in Apache's documentation, so please verify it beforehand. Simple RewriteRule changes require only an Apache reload.
  • On deploy1001, re-run all httpbb tests on an affected host. Your new tests verify that your intended change is functioning correctly, and re-running the old tests verifies that existing behavior wasn't inadvertently changed in the process. All tests are expected to pass -- if they don't, you should revert your change.
    rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/* --host mwdebug1001.eqiad.wmnet
    Sending to mwdebug1001.eqiad.wmnet...
    PASS: 101 requests sent to mwdebug1001.eqiad.wmnet. All assertions passed.
    
  • Enable/Run puppet on another mediawiki application server that is taking traffic, de-pooling it beforehand via confctl. Verify again from deploy1001 that everything is working as expected, running httpbb.
  • Repool the host mentioned above and verify on Apache access logs that everything looks fine. If you want to be extra paranoid, you can check the host level metrics via https://grafana.wikimedia.org/d/000000327/apache-hhvm?orgId=1 and make sure that nothing is out of the ordinary.
  • Re-enable puppet across the appservers previously disabled via cumin.
  • Keep an eye on the operations channel and make sure that puppet runs fine on these hosts.


Nginx

TODO.

Mcrouter

TODO.

Envoy

Envoy is used for proxying calls from MediaWiki to external services. It's a resilient service, and it should not fail usually. Some quick pointers:

  • Logs are under /var/log/envoy.
  • /var/log/envoy/syslog.log (or sudo journalctl -u envoyproxy.service) to see the daemon logs
  • Verify that configuration is valid: sudo -u envoy /usr/bin/envoy -c /etc/envoy/envoy.yaml -mode validate.
  • Envoy uses a hot restarter that allows seamless restarts without losing a request. Use systemctl reload envoyproxy.service unless you really know why that wouldn't work.
  • You can check the status of envoy and much other info under http://localhost:9631. Of specific utility is /stats which returns current stats. Refer to the admin interface docs for details.

If you see an error about runtime variables being set, reloading envoy should solve the alert in a few minutes.

PHP 7

PHP 7 is the interpreter we use for serving mediawiki. This page collects resources about how to troubleshoot and fix some potential issues with it. For more general information about how we serve mediawiki in production see the page about Application servers

Logging

Dashboards

Debugging procedures and tools

php7adm

php7adm is a tool that allows to interact with the local php-fpm daemon to gather information on the running status of the application. Its usage is pretty simple:

$ php7adm [OPTION]

To see a list of available actions, just run the command without arguments:

$ php7adm 
Supported urls:

  /metrics         Metrics about APCu and OPcache usage
  /apcu-info       Show basic APCu stats
  /apcu-meta       Dump meta information for all objects in APCu to /tmp/apcu_dump_meta
  /apcu-free       Clear all data from APCu
  /opcache-info    Show basic opcache stats
  /opcache-meta    Dump meta information for all objects in opcache to /tmp/opcache_dump_meta
  /opcache-free    Clear all data from opcache

All data, apart from the /metrics endpoint, are reported in json format.

Low-level debugging

Php-fpm is a prefork style appserver, which means that every child process will be serving just one request at a time. So attaching with strace to an individual process should give you a lot of information on what is going on there. We still don't have an automated dumper of stacktraces from php-fpm, but you can use as usual quickstack for a quick peek at the stacktraces, or gdb for more details.

Response to common alerts

PHP7 opcache health

This alert can arise in three different scenarios:

  1. The opcache is full
  2. The opcache has a too low cache hit ratio
  3. The opcache has little free space

It's quite possible multiple servers get the same alert at the same time. That's because what uses up opcache is deployments, so the opcache usage goes hand in hand for servers that have been restarted before the same deployment. If you want to know more about what's going on, you can fetch the info yourself:

$ php7adm /opcache-info | jq .
{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
...

While we should have a cron checking for these conditions and doing the work for us sooner than later, you can still fix the issue by safely doing a restart of php-fpm:

$ sudo -i /usr/local/sbin/restart-php7.2-fpm

Be careful If you're restarting multiple servers this way, as restart-php7.2-fpm depools the server, restarts php-fpm, then repools the server. You should never run restart on more than 10% of the servers in a cluster at the same time.

PHP7 rendering

This alert comes from trying to render a page on enwiki using php7 (not HHVM). Since the request goes through apache httpd, first check if apache is alerting as well. In that case, something else might be at play (like, HHVM being stuck and using up all of httpd's connection slots). Then look at opcache alerts. If there is a critical alert on opcache too, look at the corresponding section below. If only the php7 rendering is alerting, check the following:


  • What does the php-fpm log say? Any specific errors repeating right now?
 
$ tail -f /var/log/php7.2-fpm/error.log | fgrep -v '[NOTICE]'
Jun  0 00:00:00 server php7.2-fpm[pid]: [WARNING] [pool www] child PID, script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' (request: "GET /wiki/Special:Random") executing too slow (XX.XX sec), logging
...

For example, if you see a lot of slow requests from a specific PID, it might be interesting to strace it. If some strange and unique error message is present, probably the opcache is corrupted. In that case confirm by resetting opcache and verifying the problem supersedes. Search if we have an open ticket about opcache corruptions and register the occurrence there.


  • What can you see looking at non-200 responses that come from php-fpm in the apache log? Any trend? anything stands out?
# This will just show 5xx errors, nothing else.
 $ tail -n100000 -f /var/log/apache2/other_vhosts_access.log | fgrep fcgi://localhost/5
  • If nothing conclusive comes out of it, you can still probe the processes with the usual debugging tools. In that case, depool the server for good measure
$ sudo -i depool

IMPORTANT: remember to also repool it afterwards.


If this happens once, and on just one server, I suggest to just restart php-fpm

$ sudo -i /usr/local/sbin/restart-php7.2-fpm

and watch the logs/icinga to see if the issue superseeded. If the issue is on more than just one server, escalate to the SRE team responsible for the service.

Service Ops

Adding a new server into production

  • Create DNS patch to assign IP addresses to them. This is usually done by dcops nowadays but they might want your review for it. (example change)
  • Create a puppet patch that adds the servers with the right regexes in site.pp. Apply the spare::system puppet role. (example change)
  • Create mcrouter certs, merge them in the private puppet repo on the puppetmaster (as of today Puppetmaster1001).
  • Create a patch to add fake certs in the labs/private repo. Merge it. In the labs/private repo you have to also add the V+2 yourself, no jenkins. (example change)
  • Decide which role this server should have (appserver, API appserver, jobrunner,..). Use Netbox to search for the host and see which rack it is in. Try to balance server roles across both racks and rows.
  • Create a puppet patch that adds the proper role to the servers and adds them in conftool-data in the right section. Don't merge it yet. (example change)
  • Disable puppet on Icinga to avoid Icinga alert spam. ex: [icinga1001:~] $ sudo puppet agent --disable <reason/ticket ID>
  • Schedule Icinga downtimes for your new hosts for 1h. ex: dzahn@cumin1001:~$ sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 mw13[63,74-83].eqiad.wmnet
  • Merge the patch to add puppet roles to the new servers.
  • Force a puppet run via cumin. Some errors are normal in the first puppet run. ex: dzahn@cumin1001:~$ sudo -i cumin -b 15 'mw13[63,74-83].eqiad.wmnet' 'run-puppet-agent -q'
  • Force a second puppet run via cumin. It should complete successfully.
  • Re-enable puppet on icinga: [icinga1001:~] $ sudo puppet agent --enable
  • Run downtime with force-puppet-run via cumin ex: dzahn@cumin1001:~$ sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 --force-puppet mw13[63,74-83].eqiad.wmnet
  • Watch all (new) Icinga alerts on the hosts turn green, make sure Apache does not have to be restarted. You can "reschedule next service check" in the Icinga web UI to speed things up. It is expected that the "not in dsh group" alert stays CRIT until the server is pooled below. Once all alerts besides that one are green (not PENDING and not CRIT) it is ok to go ahead.
  • Check for ongoing deployments. Wait if that is the case. You can use "jouncebot: now" on IRC and/or the Deployment page on Wikitech wiki.
  • Run "scap pull" on new servers to ensure latest MediaWiki version deployed is present.
  • Give the server a weight with confctl: ex: [cumin1001:~] $ sudo -i confctl select name=mw1355.eqiad.wmnet set/weight=30
  • Pool the server with confctl: ex: [cumin1001:~] $ sudo -i confctl select name=mw1353.eqiad.wmnet set/pooled=yes
  • Watch Grafana Host Overview, select server and see it is getting traffic.

Spreading application servers out across rows and racks

We are aiming to spread out application server roles (regular appserver, API appserver, etc) across both rows (ex. B) as well as racks (ex. B3) in each of the main data centers (currently eqiad and codfw).

Our new pattern to achieve this is alternating between appserver and API appserver in each row where odd numbers represent appservers and even numbers represent API appservers.

example:

mw1385 - appserver  - rack A5
mw1386 - API server - rack A5
mw1387 - appserver  - rack A5
mw1388 - API server - rack A5
..

In puppet's site.pp this results in a structure with regexes like this:

## DATACENTER: EQIAD
..
# Appservers
# Row A
..
# rack A5
node /^mw13(8[579]|91)\.eqiad\.wmnet$/ {
    role(mediawiki::appserver)
}
...
# rack A5
node /^mw13(8[68]|9[02])\.eqiad\.wmnet$/ {
    role(mediawiki::appserver::api)
}

# Row B
...

## DATACENTER: CODFW
..
# Appservers
# Row A
..
# rack A4

In this example rack A5 is split across the 2 roles and ideally the same pattern should repeat for each rack in each row in each datacenter.

Removing old appservers from production (decom)

  • Identify servers you want to decom in netbox. The procurement ticket linked from there tells you the purchase date to see how old they are.
  • Create a Gerrit patch that removes the servers from site.pp and conftool-data. (example change) but don't merge it yet.
  • Set the servers to 'pooled=no' and watch in Grafana how they stop serving traffic, temperature goes down etc. ex: [cumin1001:~] $ sudo -i confctl select 'name=mw123[2-5].eqiad.wmnet' set/pooled=no
  • Use the downtime cookbook to schedule monitoring downtimes for the servers. Give a reason and link to your decom ticket. ex: [cumin1001:~] $ sudo cookbook sre.hosts.downtime -r decom -t T247780 -H 2 mw125[0-3].eqiad.wmnet.
  • If everything seems fine, set the servers to 'pooled=inactive' now. ex: [cumin1001:~] $ sudo -i confctl select 'name=mw125[0-3].eqiad.wmnet' set/pooled=inactive
  • If you are sure, run the actual decom cookbook now. This step is destructive so you will have to reinstall servers to revert. ex: [cumin1001:~] $ sudo cookbook sre.hosts.decommission mw125[0-3].eqiad.wmnet -t T247780
  • Merge your prepared puppet change to remove them from site and conftool-data.
  • optional: Run puppet on Icinga and see the servers and services on them disappear from monitoring.
  • optional: Confirm in Netbox the state of the servers is "decommissioning" now.
  • Create and merge a change in the puppet repo to remove the servers from DHCP config (and check for other occurences of the hostnames).
  • Check if any of the servers was an mcrouter proxy (hieradata/common/mcrouter.yaml) or a scap proxy (hieradata/common/scap/dsh.yaml). Remove if needed. (example change)
  • Create and merge a change in the DNS repo to remove the production IPs and mgmt IPs while keeping the asset tag names for the mgmt interfaces.
  • Hand over the decom ticket to dcops for physical unracking and the final steps in the server lifecycle.