Traffic/Cookbooks
CDN
roll-restart-varnish.py
Purpose
Roll restart Varnish frontend based on parameters
How to Use
Roll restart Varnish frontend based on parameters.
Example usage:
cookbook sre.cdn.roll-restart-varnish --alias cp-text_codfw --reason 'Emergency restart' \
--grace-sleep 30 restart_daemons
cookbook sre.cdn.roll-restart-varnish --query 'A:cp-eqiad and not P{cp1001*}' --reason 'Emergency restart' \
--batchsize 2 restart_daemons --threads-limited 100000
Flags
- --threads-limited
- Restart Varnish only if the varnish_main_threads_limited metric variation in the last 10 minutes metric is above the given threshold.
transfer-purged-positions.py
Purpose
Transfer kafka consumer offsets for purged topics.
This cookbook is useful when moving purged from one DC to another
Those are the steps provided by the cookbook:
- Depool the host (automatic) - Disable icinga notification (automatic) - Check that puppet-agent is disabled (pre_scripts) - Transfer kafka consumer offset (_custom_action) - Enable and run Puppet (puppet agent **must** be disabled first with cumin) (_custom_action()) - Repool the host (automatic)
How to Use
Roll apply configuration using puppet-agent for purged
This cookbook is useful when moving purged consumers from one kafka DC to another and syncing positions across consumers groups is required
Example usage(s):
cookbook sre.cdn.transfer-purged-positions \
--alias cp-text_codfw \
--dc-to codfw \
--reason 'move purged back to codfw' \
--puppet-reason 'TXXXXXX'
Flags
- --puppet-reason
- Puppet reason. Must match puppet disable reason used by Cumin
- --dc-to
- Name of the datacenter switch to. One of %(choices)s.
roll-upgrade-ats.py
Purpose
Upgrade ATS on CDN nodes
- Depool
- Install specified trafficserver version
- Restart trafficserver
- Repool
How to Use
Roll upgrade Apache Traffic Server based on parameters.
Example usage:
cookbook sre.cdn.roll-upgrade-ats \
--alias cp-text_codfw \
--reason '9.2.0 upgrade' \
--version '9.2.0-1wm1'
Flags
- --version
- Specific version to install.
roll-restart-reboot-ncredir.py
Purpose
NCRedir roll operations cookbook.
How to Use
Cookbook to perform a rolling reboot/restart of NCRedir
Usage example:
cookbook sre.cdn.roll-restart-reboot-ncredir --reason "Rolling reboot to pick up new kernel" reboot
cookbook sre.cdn.roll-restart-reboot-ncredir --reason "Rolling restart to pick new OpenSSL" restart_daemons
run-puppet-restart-varnish.py
Purpose
Swap port 80 from Varnish to HAProxy, stopping Varnish first and then running Puppet.
This cookbook is useful when managing both Varnish and HAProxy configuration and a ordered restart/reload is needed. Has been created to swap the daemon listening on port 80 from Varnish to HAProxy but can be used as template for other tasks too.
Those are the steps provided by the cookbook:
- Depool the host (automatic) - Disable icinga notification (automatic) - Check that puppet-agent is disabled (pre_scripts) - Stop Varnish service (_custom_action) - Enable and run Puppet (puppet agent **must** be disabled first with cumin) (_custom_action()) - Start Varnish service (_custom_action()) - Test that ports (80 and 443) are open (post_action) - Repool the host (automatic)
How to Use
Roll apply configuration using puppet-agent for both HAProxy and Varnish
This cookbook is useful when managing both Varnish and HAProxy configuration and a ordered restart/reload is needed. Has been created to swap the daemon listening on port 80 from Varnish to HAProxy but can be used as template for other tasks too.
Example usage(s):
cookbook.sre.cdn.run-puppet-restart-varnish \
--alias cp-text_codfw \
--reason 'Let HAProxy manage port 80' \
--puppet-reason 'TXXXXXX' \
--grace-sleep 1200
Flags
- --puppet-reason
- Puppet reason. Must match puppet disable reason used by Cumin
roll-reboot.py
Purpose
Depool, unmonitor, and reboot instances one-by-one.
How to Use
Reboot CP nodes in the CDN
Example usage:
cookbook sre.cdn.roll-reboot --alias 'cp-text_ulsfo' --reason 'Kernel update' --task-id T123456
cookbook sre.cdn.roll-reboot --alias 'cp-text_ulsfo' --reason 'Kernel update' --task-id T123456 --grace-sleep 1200
roll-upgrade-haproxy.py
Purpose
Upgrade HAProxy on CDN nodes
How to Use
Roll upgrade HAProxy based on parameters.
Example usage:
cookbook sre.cdn.roll-upgrade-haproxy --alias cp-text_codfw --reason '2.6.9 upgrade' \
--grace-sleep 30 restart_daemons
cookbook sre.cdn.roll-upgrade-haproxy --query 'A:cp-eqiad and not P{cp1001*}' --reason '2.6.9 upgrade' \
--batchsize 2 restart_daemons
upgrade-varnish.py
Purpose
Upgrade/downgrade Varnish on the given cache host between major releases.
- Set Icinga/Alertmanager downtime - Depool - Disable puppet (unless invoked with --hiera-merged) - Wait for admin to merge hiera puppet change (unless invoked with --hiera-merged) - Remove packages - Re-enable puppet and run it to upgrade/downgrade - Run a test request - Repool - Remove Icinga/Alertmanager downtime
Usage example:
cookbook sre.hosts.upgrade-varnish --hiera-merged "Upgrading varnish -- TXXXXXX" cp3030.esams.wmnet
How to Use
Flags
- host
- FQDN of the host to act upon.
- --downgrade
- Downgrade varnish instead of upgrading it
- --hiera-merged
- Pass this flag if hiera is already updated and puppet is disabled on the host with this message
DNS
roll-restart-reboot-wikimedia-dns.py
Purpose
Rolling restart of Wikimedia DNS services or full reboot.
Whether restarting the services or rebooting the entire host, typical decommissioning logic is preserved. systemd unit ordering should safeguard these units as well.
How to Use
Rolling restart of Wikimedia DNS services
Example usage:
cookbook sre.dns.roll-restart-reboot-wikimedia-dns --alias wikidough-codfw --reason "Scheduled maintenance" restart_daemons
cookbook sre.dns.roll-restart-reboot-wikimedia-dns --query 'A:wikidough-eqiad and not P{doh1001*}' --reason "Scheduled maintenance" --task-id "T12345" --ignore-restart-errors --batchsize 2 --grace-sleep 90 restart_daemons
cookbook sre.dns.roll-restart-reboot-wikimedia-dns --query 'A:wikidough-eqiad and not P{doh1001*}' --reason "Scheduled maintenance" --task-id "T12345" --batchsize 2 reboot
roll-restart-haproxy.py
Purpose
Rolling service restart of haproxy, part of A:dnsbox and providing internal recursor services.
How to Use
Rolling restart of HAProxy on <TODO>.
Example usage:
cookbook sre.dns.roll-restart-haproxy --query '<TODO> and not P{dns1004*}' --reason "Scheduled maintenance" reboot
roll-restart-reboot-durum.py
Purpose
Rolling service restart or reboot of durum, the Wikimedia DNS check service.
This is based on the roll-restart-reboot-wikimedia-dns cookbook, with relevant changes to durum.
How to Use
Rolling restart of durum.
Example usage:
cookbook sre.dns.roll-restart-reboot-durum --query 'A:durum-eqiad and not P{durum1001*}' --reason "Scheduled maintenance" reboot
cookbook sre.dns.roll-restart-reboot-durum --query 'A:durum-eqiad and not P{durum1001*}' --reason "Scheduled maintenance" --task-id "T12345" --ignore-restart-errors --batchsize 2 --grace-sleep 90 restart_daemons
roll-restart-ntp.py
Purpose
Rolling restart of ntpsec.service on the DNS hosts identified by A:dnsbox.
How to Use
Rolling restart of ntpsec.service on the DNS hosts.
This cookbook is for the rolling restarts of ntpsec.service on the DNS hosts. Since Puppet no longer manages the restarts for us (intentionally), this cookbook helps us do that and sets sane automatic defaults for the batches and sleep intervals.
Note that there is an alert in place for ntp.conf: if the file is modified and ntpsec.service is not restarted to pick up the changes, we are alerted about that. The fix for that is to restart ntpsec.service and now it should be done through this cookbook.
Example usage:
cookbook sre.dns.roll-restart-ntp --alias 'A:dnsbox' --task-id T12345 --reason 'Restarting ntp service'
cookbook sre.dns.roll-restart-ntp --alias 'A:dnsbox' --task-id T12345 --reason 'Restarting ntp host' --grace-sleep 900
admin.py
Purpose
Cookbook for GeoDNS pool/depool of a site.
How to Use
Cookbook for GeoDNS pool/depool of a site.
Pool or depool a site for GeoDNS. By default, it will act on a given site for all services (text-addrs, upload-addrs, etc.) unless a service is manually specified via --service.
Usage examples:
cookbook sre.dns.admin depool eqiad # [depools eqiad for everything]
cookbook sre.dns.admin pool magru # [pools magru for everything]
cookbook sre.dns.admin --service upload-addrs -- depool codfw # [depool codfw for upload-addrs]
cookbook sre.dns.admin depool esams --service text-addrs text-next # [depool esams for text*]
Flags
- action
- The kind of action to perform (pool, depool, or show).
- valid arguments are: pool, depool, show
- site
- The site/DC on which to perform the action on.
- -s, --service
- The service in the site/DC on which the action should be performed.
- -r, --reason
- An optional reason for the action.
- -t, --task-id
- An optional Phabricator task ID to log the action.
- -f, --force
- If passed, do not prompt for any actions (default: prompt)
- --emergency-depool-policy
- If passed, override the depool threshold and ignore all depool safety checks
roll-reboot.py
Purpose
Rolling reboot of DNS hosts identified by the cumin alias A:dnsbox.
How to Use
Rolling reboot of DNS hosts.
This cookbook is for the rolling reboots of the DNS hosts, referred to by the cumin alias A:dnsbox. This covers both the DNS rec and auth hosts since a given DNS box serves both roles.
Example usage:
cookbook sre.dns.roll-reboot --alias 'A:dnsbox' --task-id T12345 --reason 'Restarting DNS host'
cookbook sre.dns.roll-reboot --alias 'A:dnsbox' --task-id T12345 --reason 'Restarting DNS host' --grace-sleep 900
roll-restart.py
Purpose
Rolling service restart of pdns-recursor and HAProxy on A:dnsbox
How to Use
Rolling restart of pdns-recursor and HAProxy on A:dnsbox.
Example usage:
cookbook sre.dns.roll-restart --query 'A:dnsbox and not P{dns1004*}' --reason "Scheduled maintenance" restart_daemons
Flags
- service
- The service to restart
- valid arguments are: authdns-.*