Incidents/20150806-poolcounter
Appearance
20150806-poolcounter
Summary
A firewall change was merged on server helium which serves as Bacula director and also as a poolcounter. There was a failure to start the ferm service. As soon as iptables rules were loaded the conntrack table filled up. Packets and connections to the poolcounter were dropped. This caused an API outage (cached things were unaffected).
Timeline
- ~ 20:29 UTC - gerrit:229054 gets merged to complete T104996 and apply ferm rules on helium, ferm rules for both bacula and poolcounter exist
- ~ 20:31 UTC - Daniel runs puppet on helium and sees ferm fails to start with error [1]
- ~ 20:32 UTC - Icinga starts to report Socket timeouts for Apache HTTP and HHVM rendering
- ~ 20:33 UTC - Daniel runs manual script to flush all iptables rules that was there in case ferm fails [2], disables puppet agent
- ~ 20:40 UTC - Brandon starts looking at helium, connects via mgmt and finds [3] shortly after
- ~ 20:42 UTC - Daniel reverts gerrit:229054
- ~ 20:43 UTC - Daniel re-enables puppet-agent, attempts run, gets fails because helium fails DNS lookups, packets are still dropped
- ~ 20:43 UTC - Brandon manually rmmod's all the iptables kernel modules
- ~ 20:45 UTC - Icinga RECOVERies start showing up
Conclusions
- avoid the issue with the NOTRACK target in ferm - if ferm fails and conntrack table fills up, rmmod kernel modules, flushing all tables is not enough - poolcounter is a SPOF (T105378)
Actionables
- stop a poolcounter server fail from being a SPOF (T105378)
- detect failing ferm restarts (T108303)
- fix gerrit:228137
[1]
867 Aug 6 20:29:51 helium kernel: [2422180.176691] ip6_tables: (C) 2000-2006 Netfilter Core Team 868 Aug 6 20:29:52 helium kernel: [2422181.102840] x_tables: ip_tables: NOTRACK target: only valid in raw table, not filter 869 Aug 6 20:29:52 helium kernel: [2422181.123525] x_tables: ip6_tables: NOTRACK target: only valid in raw table, not filter 870 Aug 6 20:29:52 helium puppet-agent[10048]: (/Stage[main]/Ferm/Service[ferm]) Failed to call refresh: Could not start Service[f erm]: Execution of '/etc/init.d/ferm start' returned 1: 871 Aug 6 20:29:52 helium puppet-agent[10048]: (/Stage[main]/Ferm/Service[ferm]) Could not start Service[ferm]: Execution of '/etc /init.d/ferm start' returned 1:
[2]
#!/bin/sh # removes all iptables rules # https://en.wikipedia.org/wiki/Tear_down_this_wall! echo "flushing all iptables rules.." iptables -F iptables -X iptables -t nat -F iptables -t nat -X iptables -t mangle -F iptables -t mangle -X iptables -P INPUT ACCEPT iptables -P FORWARD ACCEPT iptables -P OUTPUT ACCEPT echo "done"
[3]
Aug 6 20:30:32 helium kernel: [2422220.736092] nf_conntrack: table full, dropping packet.