User:Taavi/Loki notes 2.0
Appearance
This page is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
notes for deploying Grafana Loki as log storage for Toolforge.
previous attempt: User:Taavi/Loki notes (2021)
deployment plan
- solve TODOs below with a lima-kilo deployment
- deploy loki/alloy to toolsbeta
- deploy loki/alloy to tools
- can start with specific nodes/tools to ensure the scale doesn't cause issues
- move
toolforge jobs logsto query loki - deprecate file logging
loki deployment
- Separate loki deployments for tool log data, and infrastructure data
- ingress-nginx access logs probably out of scope for this
- TODO: what about buildservice build logs?
- storage: s3 onto ceph, Help:Object storage user guide
- sizing
- assuming a single tool produces max 10M/day (which is probably totally overkill, but that's better than the opposite), and we have ~2k active tools[1]
- => this is about 20G/day
- the docs say the monolithic mode is up for up to 20G/day, so for some future-proofing let's just go with the simple scalable mode.
- per above assumptions 20G/day * 14 days (which seems like a good retention period as a starter) = 280G total log volume.
- loki indexes and such will add some overhead, but not a ton
- we should still stay clearly under half a terabyte with all this
- TODO: can ceph/rados enforce per-bucket limits?
- TODO: rate limits
- alloy's stage.limit for per-pod (or per-worker+tool) limiting
- also a global per-tool limit in loki?
ingestion
- grafana alloy (since promtail got deprecated :()
- Run as a daemonset, query logs from local filesystem
- Alloy supports streaming logs from the K8s api, but that seems a bit too expensive when this also works
- Alloy seems to support feeding things into multiple lokis natively based on some matching
querying
- This needs a bit more planning
- Initial interface is probably replacing
toolforge jobs logs- Should jobs-api talk to loki directly? or go through some other custom service?
- Description of phab:T127367 wants a separate service for this