Monitoring Discussion. This meeting was held on 2020-04-11.

Scope

Started discussing around what metrics we need, and are metrics enough
jeh: prometheus node-exporter is installed on all VMs today, which also reports the same data we have in shinken today
andrew: leveraging prod architecture, pros and cons
arturo: we may not want to follow prod a lot
brooke: retention. can we decide in prometheus what to retent more or less?
jeh: can use time or size based storage retention policy
arturo: retention is directly related to storage capacity

modules/prometheus/manifests/server.pp: $storage_retention = '730h', <--- default retention in the prometheus puppet module, what we are using in tools-prometheus (1 month)

bd808: multitenancy, a prometheus instance per project
andrew: does prometheus even support multitenancy?
brooke: somehow yes, by using labels
andrew: security concerns with multitenancy? or only organizational concerns?
brooke: not today, but we need to keep security in mind
bd808: log aggregation is scarier
andrew: central prometheus server vs per project prometheus server
jeh: network scoping, security groups, etc
arturo: prometheus proxy
jeh: push gateway from prometheus server: not very smart for dynamic environments like VMs being created and destroyed
brooke:
- scope1: inmediate need to shutdown shinken. We can shutdown it today and don't loss many
- scope2: centralice & multi tenant servicec
andrew: imagine a cloud project admin wanting a simple grafana dashboard with prometheus metrics.
jeh: a local prometheus server allows for custom, per-project alerts. And then a central grafana
brooke: we apparently are leaning towards prometheus
arturo: replacing shinken with prometheus+alertmanager could be a good experiment before introducing any cloud-wide solution
brooke: alertmanager outgoing alerts? smtp server?
jeh: yes, email + [..] How do we do it with shinken today?
brooke: let's make a task to replace shinken with prometheus+alertmanager. Alert: only for us for now.
brooke: Jason, would you be willing to handle the initial shinken replacement task?
jeh: sure. What about security groups?
andrew: probably update every security group out there. Few projects use shinken. Initial change only in the tools project.
andrew: share info wiht krenair
jeh: we already have a prometheus server in the tools project. Would it make sense to just extend it with alertmanager?
andrew: yeah, why not.
arturo: make this an OKR for proper credits for jason
brooke: maybe search/create an objective to relate all things together. Also, epic phab task https://phabricator.wikimedia.org/T194333
jeh: what do we call it? Use prometheus openstack integration to auto discover VMs (node exporter, puppet alerts on day 1, expand from there).
jeh: initial notifications by email + IRC bots?
brooke: legit!
brooke: next step, replace toolschecker, because shinken couldn't generate pages.
andrew: prometheus in english meas forethinker
arturo: what about monitoring-infra
jeh: create new openstack project, add new prometheus server, update existing server groups (and new project template), configure prometheus openstack-sd-config to scrape vms, configure alert manager to email wmcs-team and notify cloud-feed IRC

brooke: metrics-infra, is shorter!

arturo: shinken project deprecation phab task: https://phabricator.wikimedia.org/T236547
arturo: an epic phab task with many subtasks: https://phabricator.wikimedia.org/T194333