Portal:Toolforge/Admin/Logs Service
Documentation of components and common admin procedures for Logs Service, it's currently embedded as part of the jobs-cli (see Portal:Toolforge/Admin/Jobs Service).
Components
- Logs API (source code): main entry-point for clients (users use the jobs cli)
- Tools Loki (source code, under components/logging): Ingests and stores the pod logs for the logs api to retrieve later.

Alerts
List of alerts: https://prometheus.svc.toolforge.org/tools/alerts?search=logs
Runbooks: Category:LogsApiRunbooks.
- Dashboard from the cloud UI: https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive&q=project%3D~^%28tools%7Ctoolsbeta%29
- Dashboard from the prod UI: https://alerts.wikimedia.org/?q=team%3Dwmcs&q=project%3D~%28tools%7Ctoolsbeta%29
Dashboards
https://grafana-rw.wmcloud.org/d/kcAb-KUSe/logs-service-overview
Main phabricator board
https://phabricator.wikimedia.org/project/board/539/
Administrative tasks
Starting a service
Logs API
This lives in kubernetes, behind the API gateway. To start it you can try redeploying it. To do so follow Portal:Toolforge/Admin/Kubernetes/Components#Deploy (the component is logs-api).
You can monitor if it's coming up with the usual k8s commands:
root@tools-k8s-control-9:~# kubectl get all -n logs-api
NAME READY STATUS RESTARTS AGE
pod/logs-api-7b956f999f-k9htb 2/2 Running 1 (4d21h ago) 4d21h
pod/logs-api-7b956f999f-q6dws 2/2 Running 1 (4d21h ago) 4d21h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/logs-api ClusterIP 10.111.193.102 <none> 8443/TCP 4d21h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/logs-api 2/2 2 2 4d21h
NAME DESIRED CURRENT READY AGE
replicaset.apps/logs-api-7b956f999f 2 2 2 4d21h
Tools loki log ingestion
Also a k8s component, follow Portal:Toolforge/Admin/Kubernetes/Components#Deploy (the component is logging).
root@tools-k8s-control-9:~# kubectl get all -n loki
NAME READY STATUS RESTARTS AGE
pod/loki-tools-backend-0 1/1 Running 0 43d
pod/loki-tools-backend-1 1/1 Running 0 43d
pod/loki-tools-backend-2 1/1 Running 0 43d
pod/loki-tools-read-78fcb9b8f-jtlqk 1/1 Running 0 43d
pod/loki-tools-read-78fcb9b8f-px9f7 1/1 Running 0 43d
pod/loki-tools-read-78fcb9b8f-tpdv5 1/1 Running 0 43d
pod/loki-tools-write-0 1/1 Running 0 43d
pod/loki-tools-write-1 1/1 Running 0 43d
pod/loki-tools-write-2 1/1 Running 0 43d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/loki-tools-backend ClusterIP 10.106.142.0 <none> 3100/TCP,9095/TCP 111d
service/loki-tools-backend-headless ClusterIP None <none> 3100/TCP,9095/TCP 111d
service/loki-tools-memberlist ClusterIP None <none> 7946/TCP 111d
service/loki-tools-query-scheduler-discovery ClusterIP None <none> 3100/TCP,9095/TCP 111d
service/loki-tools-read ClusterIP 10.102.254.64 <none> 3100/TCP,9095/TCP 111d
service/loki-tools-read-headless ClusterIP None <none> 3100/TCP,9095/TCP 111d
service/loki-tools-write ClusterIP 10.109.223.201 <none> 3100/TCP,9095/TCP 111d
service/loki-tools-write-headless ClusterIP None <none> 3100/TCP,9095/TCP 111d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/loki-tools-read 3/3 3 3 111d
NAME DESIRED CURRENT READY AGE
replicaset.apps/loki-tools-read-78fcb9b8f 3 3 3 111d
NAME READY AGE
statefulset.apps/loki-tools-backend 3/3 111d
statefulset.apps/loki-tools-write 3/3 111d
Note that there's also a daemonset (called alloy) that starts a pod on each kubernetes worker to gather the logs and send them to the loki service:
root@tools-k8s-control-9:~# kubectl get all -n alloy
NAME READY STATUS RESTARTS AGE
pod/alloy-25n87 1/1 Running 0 23h
pod/alloy-26dl4 1/1 Running 0 20h
...
pod/alloy-zn5v7 1/1 Running 0 23h
pod/alloy-zz9jc 1/1 Running 0 22h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/alloy 74 74 74 74 74 <none> 111d
Stopping a service
Logs API
This is a simple deployment, you can just delete it and recreate it.
TBD: add commands
Tools loki log ingestion
Loki deployment
TBD: it's a complicated app with three different components, write, backend and read, read is a regular deployment so you can delete it and recreate it later, write and backend are stateful sets, you can try to delete them too.
Alloy
TBD: probably you can scale down the daemonset, or just delete/recreate it
Checking all components are alive
You can check the dashboard for a high level view.
Logs API
TBD
Tools loki log ingestion
TBD
