Portal:Toolforge/Admin/Monthly meeting/2025-10-22
Appearance
Attendees (shuffled)
- Seyram Komla Sapaty
- Filippo Giunchedi
- David Caro
- Andrew Bogott
- Alexandros Kosiaris
- Bryan Davis (bd808)
- Taavi Väänänen
- Raymond Ndibe
- Francesco Negri
Notes
k8s upgrade workgroup progress
- T372697 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.31
- tv: Had no time yet to get into the upgrade of k8s itself
- Fn: is it the idea that both teams will get involved in the current upgrades or for now only tools-platform team will do them? Or it can be voluntary
- decided to make it volutary for now, and discuss later/offline
- dc: No updates on the kyverno replacement with k8s VAPs
- It’s not a blocker for the upgrade (yet)
- Ab: is production running on a single k8s version or there’s different versions on different cluster?
- ak : right now is 1.31, they are not locked, it can happen individually, there’s 2 versions supported at all times
- Ab: is the version jump an issue or it does not matter?
- Ak: currently the process requires a full cluster rebuild, but the process will change soon
Push to deploy beta
- dc: Doing another pass at the MVP/stable feature set (Push-to-deploy MVP)
- dc: Minor fixes/features see changelog https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Changelog
Sustainability score
- Sk: reached out with deadline end of october to get feedback, today will follow up with a reminder
NFS server update
- Filippo cautiously optimistic tools NFS server upgrade is helping with workers getting stuck in ‘D’ state https://phabricator.wikimedia.org/T404584 🎉
TOM (toolforge on metal) update
- Still planning/documentation stage:
- Top-level task https://phabricator.wikimedia.org/T407296
- Identify candidate tools https://phabricator.wikimedia.org/T407502
- Persistent volumes first?
- That would probably add a lot of complexity. Although we want to have those eventually and move away from NFS completely.
- Should we just use Cloud-VPS NFS servers?
- DC: I would start with a separate instance of NFS.
- AB: We can have a physical host (or Ganeti VM) running the NFS server, but still have the data on Ceph. So we don’t have storage tied to a single physical server.
- AK: We probably don’t need a Ganeti cluster because the puppetization of the control plane allows to collocate both etcd as well as workloads (use the control plane as workers as well effectively)
Side question: is NFS scratch space, home dirs and clouddumps?
- The complexity for NFS comes mostly from user and tool home dirs, scratch and dumps should be easy to just use/replicate
Logs/loki
- DC: Rate limit changes this week: we’re losing fewer logs but we’re using more disk space