Portal:Cloud VPS/Admin/Alerts

From Wikitech
Jump to navigation Jump to search

Alerts possible to WMCS-team (or WMCS-bots as of now):


  • Nova-Fullstack (labnet) - Launch a "full" test of instance creation
  • nova-network (labnet) - handle dynamic NAT and networking gateway
  • nova-api (labnet) - main API gateway for interacting with nova (creation, deletion, etc)
  • nova-scheduler (labcontrol) - schedule and launch instances
  • nova-compute - handles setup and tear down of instances on hypervisor
  • nova-conductor - DB broker for nova components not-nova-api


  • glance-api-http (control) - image management for instances


  • projects and users
    • check-novaobserver-membership - Make sure 'novaobserver' has 'observer' everywhere
    • check-novaadmin-membership - Make sure 'novaadmin' has 'projectadmin' and 'user' everywhere
    • check-keystone-projects - Verify service projects
  • services
    • keystone-http-${auth_port} - admin API port avail (little context)
    • keystone-http-${public_port} - public API port (little context)


  • check_designate_api_process: service api for DNS changes
  • designate-api-http: api external monitoring
  • check_designate_sink_process
  • check_designate_central_process
  • check_designate_mdns`
  • check_designate_pool-manager


  • nfsd-exports - sets up /etc/export.d/ files for instances in cloud
  • interfaces - saturation in/out
  • ldap - there is a scheme to use LDAP for groups w/o having the entire system be an LDAP client.
  • secondary - checks specific to the 'secondary' Tooforge DRBD/NFSd cluster



  • tools-proxy - reverse proxy for all web tools
  • tools-checker-self - reverse proxy for actual check running. This is to monitoring toolforge from prod icinga atm.
  • tools-checker-ldap - without LDAP Toolfroge crumbles.
  • tools-checker-labs-dns-private - verify resolution for internal DNS from within Toolforge
  • tools-checker-nfs-home - NFS /home test (this is a subpath really of one export for project and home)
  • tools-checker-grid-start-trusty - starting and running a process on grid
  • tools-checker-etcd-flannel - etcd is the backend for flannel which is our networking overlay for k8s
  • tools-checker-etcd-k8s - etcd is the persistent data store for k8s itself
  • tools-checker-k8s-node-ready - check to see if k8s thinks workers are healthy (nods)


Ceph Cluster Health

Global Ceph cluster health state.

Icinga status check: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Ceph+Cluster+Health

Grafana dashboard: https://grafana.wikimedia.org/dashboard/db/cloudvps-ceph-cluster

Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/


  • 0 - Healthy
  • 1 - Unhealthy (The cluster is currently degraded, but there should be no interruption in service.)
  • 2 - Critical (The cluster is in a critical state, it's very likely there are non-functioning services or inaccessible data.)

Next steps: Connect to one of the Ceph mon hosts and identify the cause

cloudcephmon1001:~$ sudo ceph health detail
cloudcephmon1001:~$ sudo ceph -s
   id:     5917e6d9-06a0-4928-827a-f489384975b1
   health: HEALTH_OK

   mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 6d)
   mgr: cloudcephmon1002(active, since 6w), standbys: cloudcephmon1003, cloudcephmon1001
   osd: 24 osds: 24 up (since 6d), 24 in (since 6d)

   pools:   1 pools, 256 pgs
   objects: 5.46k objects, 21 GiB
   usage:   87 GiB used, 42 TiB / 42 TiB avail
   pgs:     256 active+clean

   client:   18 KiB/s wr, 0 op/s rd, 3 op/s wr


   * https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Infrastructure
   * https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin