Portal:Cloud VPS/Admin/notes/Monitoring
Appearance
Monitoring Discussion. This meeting was held on 2020-04-11.
Scope
- Started discussing around what metrics we need, and are metrics enough
- jeh: prometheus node-exporter is installed on all VMs today, which also reports the same data we have in shinken today
- andrew: leveraging prod architecture, pros and cons
- arturo: we may not want to follow prod a lot
- brooke: retention. can we decide in prometheus what to retent more or less?
- jeh: can use time or size based storage retention policy
- arturo: retention is directly related to storage capacity
modules/prometheus/manifests/server.pp: $storage_retention = '730h', <--- default retention in the prometheus puppet module, what we are using in tools-prometheus (1 month)
- bd808: multitenancy, a prometheus instance per project
- andrew: does prometheus even support multitenancy?
- brooke: somehow yes, by using labels
- andrew: security concerns with multitenancy? or only organizational concerns?
- brooke: not today, but we need to keep security in mind
- bd808: log aggregation is scarier
- andrew: central prometheus server vs per project prometheus server
- jeh: network scoping, security groups, etc
- arturo: prometheus proxy
- jeh: push gateway from prometheus server: not very smart for dynamic environments like VMs being created and destroyed
- brooke:
- scope1: inmediate need to shutdown shinken. We can shutdown it today and don't loss many
- scope2: centralice & multi tenant servicec
- andrew: imagine a cloud project admin wanting a simple grafana dashboard with prometheus metrics.
- jeh: a local prometheus server allows for custom, per-project alerts. And then a central grafana
- brooke: we apparently are leaning towards prometheus
- arturo: replacing shinken with prometheus+alertmanager could be a good experiment before introducing any cloud-wide solution
- brooke: alertmanager outgoing alerts? smtp server?
- jeh: yes, email + [..] How do we do it with shinken today?
- brooke: let's make a task to replace shinken with prometheus+alertmanager. Alert: only for us for now.
- brooke: Jason, would you be willing to handle the initial shinken replacement task?
- jeh: sure. What about security groups?
- andrew: probably update every security group out there. Few projects use shinken. Initial change only in the tools project.
- andrew: share info wiht krenair
- jeh: we already have a prometheus server in the tools project. Would it make sense to just extend it with alertmanager?
- andrew: yeah, why not.
- arturo: make this an OKR for proper credits for jason
- brooke: maybe search/create an objective to relate all things together. Also, epic phab task https://phabricator.wikimedia.org/T194333
- jeh: what do we call it? Use prometheus openstack integration to auto discover VMs (node exporter, puppet alerts on day 1, expand from there).
- jeh: initial notifications by email + IRC bots?
- brooke: legit!
- brooke: next step, replace toolschecker, because shinken couldn't generate pages.
- andrew: prometheus in english meas forethinker
- arturo: what about monitoring-infra
- jeh: create new openstack project, add new prometheus server, update existing server groups (and new project template), configure prometheus openstack-sd-config to scrape vms, configure alert manager to email wmcs-team and notify cloud-feed IRC
- brooke: metrics-infra, is shorter!
- arturo: shinken project deprecation phab task: https://phabricator.wikimedia.org/T236547
- arturo: an epic phab task with many subtasks: https://phabricator.wikimedia.org/T194333