Toolforge Workgroup meeting 2023-02-21 notes

Participants

Andrew Bogott
Arturo Borrero Gonzalez
Bryan Davis
David Caro
Nicholas Skaggs
Seyram Komla Sapaty
Taavi Väänänen
Raymond Ndibe

Agenda

Toolforge API gateway - https://phabricator.wikimedia.org/T329443
cert-manager for Toolforge k8s certificates https://gitlab.wikimedia.org/repos/cloud/toolforge/cert-manager/
new container image for mariadb: https://phabricator.wikimedia.org/T254636
K8s upgrade, Taavi prepping for it?
Which buildpacks to allow for the beta: https://phabricator.wikimedia.org/T330102
Next Toolforge meetings, cadence is right?
Discussion: Adopt an upgrade policy / cadence for Toolforge Kubernetes https://phabricator.wikimedia.org/T325196

Notes

Thanks Nicholas for capturing notes!

Toolforge API gateway https://phabricator.wikimedia.org/T329443

TV: We’re building lots of new APIs. Proposal to build simple nginx proxy to act as a gateway and simplify access

DC: Uses TLS certs? How would it work?

TV: Auth user with k8s cert, have certmanager

DC: How will backend API know which user?

TV: Use an HTTP header

ABG: Move complexity for user auth out of API

TV: Also centralizes user authentication in one location for later

AB: Lots of load balancers and proxies. Can we consolidate? What features are needed? Can they be added to the existing toolforge gateway?

TV: Tools can’t do HTTPS termination. Proxy can’t do TLS termination, without exposing the service publicly. SImpler to do it as proposed in k8s cluster.

ABG: Future for toolforge front proxy is to move into k8s.

TV: Yes, move to haproxy

ABG: This is likely far in the future

BD: urlproxy does more than the k8s/grid split, but the other bits could be moved I suppose. There are more things that I am hoping to add in Q4 ;)

BD: Looking to utilize urlproxy/domainproxy for CSP header compliance moving forward, so it has more value beyond just gridengine

DC: Writing this in python or golang?

TV: It’s all in nginx, not a special service

DC: Maybe would be nice to split that to another code base, using sockets even to communicate to alleviate the network overload

FN: Any alternatives to using nginx? Envoy?

TV: No grand vision. But consider in the future to not be limited to single use cases

BD: I think the prod replacement for restbase as a proxy layer is being done with envoy?

cert-manager for Toolforge k8s certificates https://gitlab.wikimedia.org/repos/cloud/toolforge/cert-manager/

K8s dropping support for signing server certificates. Can’t sign for webhooks for example. One solution is to use cert-manager to deploy for things. Blocks upgrades to 1.22

ABG: Thank you for this work Taavi. This is the last blocker for 1.22? TV: Yes it is. Can be done fairly soon.

ABG: Within the repo, you can explore the examples. Comes with watchers that will reload certs as needed.

new container image for mariadb: https://phabricator.wikimedia.org/T254636

ABG: Brand new container, with no version attached. First time making a container using the new approach of a separate config map. Might be the last container image before buildpacks? Interesting exercise to add the new image. Exercised all the scripts to create and inject image into docker registry. Had previous discussion about what the rule for introducing container images? Patch written by Bryan, reviewed by Taavi and Kunal. So nice agreed change. Unblocks grid migration!

FN: What’s the policy for updates? Does it autoupdate? Need manual rebuild?

BD: Same as all containers today. Requires manual rebuild. No automation. Still hopes for a better replacement post-buildpacks?

FN: Tomorrow new major version of mariadb comes out. Would a rebuild trigger it?

BD: Only if debian would adopt the new version

FN: So it pulls from debian?

BD: Yes, stays in sync with debian stable.

TV: bullseye is 10.5, bookworm maybe 10.11?

FN: Was hoping for the opposite of magically jumping versions 🙂

BD: In the gerrit patch, debated putting versions numbers on things. Trying to make this a user container, not a stable software base. Mariadb commands should be stable

ABG: If you want to know the new process for getting images out: https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config. Taavi wrote something to sync this with wikitech

TV: First time trying this, but can automatically update wikitech with a bot and some lua. https://wikitech.wikimedia.org/wiki/Module:Toolforge_images/data.json, https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Choosing_the_execution_runtime, https://wikitech.wikimedia.org/wiki/Module:Toolforge_images

K8s upgrade, Taavi prepping for it?

ABG: Getting rid of old cert api. See discussion above on cert-manager

TV: Nothing to add

Which buildpacks to allow for the beta: https://phabricator.wikimedia.org/T330102

DC: Not looking for a decision. Just sharing the decision request and raising awareness. However, feel free to ask questions now or later

TV: What will workflows look like? What’s the difference between images, and how will they be chosen?

DC: Looking for how best to support multi-stack. For one-stack applications, the builder image when running, each builderpack has a detection script that will detect and run the correct things. The default way is using the autodetection within buildpacks. No way to force buildpack. If buildpack detection fails, it won’t run / apply. For multi-stack, trying to see how buildpacks respond to mixed language projects. It can be slow but seems to work. This proposal is only for the beta.

BD: One way to make a decision, pick some tools to adopt buildpacks and back into it. Pick the packs then those tools need. Stashbot is a good example. Python3, happy to move stashbot over and be an early adopter. Generally python and php are the most common containers. But maybe not the users who want to most adopt things. Rather it might be those stuck on the grid.

DC: Not all buildpacks allow apt-get install. I think just some unofficial buildpacks that allow to do that. We’ll have to see those use cases. Tried with my tool in python3, and was able to build and run it.

ABG: Beta this quarter?

DC: That was the plan, but it’s being delayed. Should be ready next quarter depending on how many buildpacks to support.

Next Toolforge meetings, cadence is right?

ABG: Is this a good format? Is this a good cadence? One meeting a month.

FN: So far we seem to have the right amount of items to discuss

TV: I like this current format and these are very useful

AB: The current system seems to be working well

DC: I find it useful. I would also welcome things like demos/show how to do things or things work.

Discussion: Adopt an upgrade policy / cadence for Toolforge Kubernetes https://phabricator.wikimedia.org/T325196 ?

NS: What’s the most important thing to focus on? Seems like cadence isn’t it

DC: Proposed something on the ticket as something to strive for, not something to adopt right away

TV: Also unsure about focusing on upgrade or tooling, but much of the upgrade is unblocked / ready. Should do the upgrade first.

DC: That’s more or less my proposal. Invest in upgrade tooling, but keep upgrading.

FN: Worth trying to write down a more specific plan about things we will happen / want to happen? Breakdown actions we can take and then prioritize tasks we commit to before taking an upgrade.

NS: Agree I don’t understand enough to know if there’s something in between changing everything (that would allow us to upgrade easily without breaking commitments) that can be targetted before the next upgrade

DC: Start a specific effort to continue (or start if it's not the goal) working towards easy repeatable toolforge deployments, so we can redeploy toolforge environments easily (lima-kilo or otherwise, I'd try to use something we can use later to redeploy toolforge if needed, so maybe terraform+helm+lima-kilo as glue, or even cookbooks if needed for orchestration).

ABG: Unsure how to avoid kubernetes deprecating things. Upgrades usually require mitigations and changes. IE, pod security policies, we knew they could drop this and they did. Also certs. No way to avoid it. Less about cadence, and more about doing work. Not sure how to avoid. Maybe over time stabilize, at least the core components. But as we use / grow k8s, we’ll adopt more things (ie tekton, fluxx, etc). This will introduce more lifecycle management.

DC: We can’t avoid this. But we can try detecting them early and allocating time for it. Being able to test on its own deployment, toolsbeta. Being able to dedicate time. Having a cadence could ensure allocated time to check and do work. Goal isn’t to upgrade every 4 months. But rather to ensure we are tracking early and investing in work early so we don’t get caught out and haven’t allocated time.

ABG: We have perfect examples. Pod security goes away in 1.25. Major problem in 4 versions. So how do we handle that / plan time for that? Ideas about openpolicy agent, research, dedicate time

NS: Seems like perhaps dedicating some time, defining problems, and then taking the pod security policy issue as an opportunity to learn by doing. Kubernetes will continue to be complex, and we don’t yet have a perfect answer for how to handle it.