Incident documentation/20150806-poolcounter

From Wikitech
Jump to: navigation, search

20150806-poolcounter

Summary

A firewall change was merged on server helium which serves as Bacula director and also as a poolcounter. There was a failure to start the ferm service. As soon as iptables rules were loaded the conntrack table filled up. Packets and connections to the poolcounter were dropped. This caused an API outage (cached things were unaffected).

Timeline

  1. ~ 20:29 UTC - gerrit:229054 gets merged to complete T104996 and apply ferm rules on helium, ferm rules for both bacula and poolcounter exist
  2. ~ 20:31 UTC - Daniel runs puppet on helium and sees ferm fails to start with error [1]
  3. ~ 20:32 UTC - Icinga starts to report Socket timeouts for Apache HTTP and HHVM rendering
  4. ~ 20:33 UTC - Daniel runs manual script to flush all iptables rules that was there in case ferm fails [2], disables puppet agent
  5. ~ 20:40 UTC - Brandon starts looking at helium, connects via mgmt and finds [3] shortly after
  6. ~ 20:42 UTC - Daniel reverts gerrit:229054
  7. ~ 20:43 UTC - Daniel re-enables puppet-agent, attempts run, gets fails because helium fails DNS lookups, packets are still dropped
  8. ~ 20:43 UTC - Brandon manually rmmod's all the iptables kernel modules
  9. ~ 20:45 UTC - Icinga RECOVERies start showing up

Conclusions

- avoid the issue with the NOTRACK target in ferm - if ferm fails and conntrack table fills up, rmmod kernel modules, flushing all tables is not enough - poolcounter is a SPOF (T105378)

Actionables

  1. stop a poolcounter server fail from being a SPOF (T105378)
  2. detect failing ferm restarts (T108303)
  3. fix gerrit:228137


[1]

867 Aug 6 20:29:51 helium kernel: [2422180.176691] ip6_tables: (C) 2000-2006 Netfilter Core Team
868 Aug 6 20:29:52 helium kernel: [2422181.102840] x_tables: ip_tables: NOTRACK target: only valid in raw table, not filter
869 Aug 6 20:29:52 helium kernel: [2422181.123525] x_tables: ip6_tables: NOTRACK target: only valid in raw table, not filter
870 Aug 6 20:29:52 helium puppet-agent[10048]: (/Stage[main]/Ferm/Service[ferm]) Failed to call refresh: Could not start Service[f erm]: Execution of '/etc/init.d/ferm start' returned 1:
871 Aug 6 20:29:52 helium puppet-agent[10048]: (/Stage[main]/Ferm/Service[ferm]) Could not start Service[ferm]: Execution of '/etc /init.d/ferm start' returned 1:

[2]


#!/bin/sh
# removes all iptables rules
# https://en.wikipedia.org/wiki/Tear_down_this_wall!
echo "flushing all iptables rules.."
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
echo "done"

[3]

Aug  6 20:30:32 helium kernel: [2422220.736092] nf_conntrack: table full, dropping packet.