Toolforge Workgroup meeting 2023-05-02 notes

Participants

Andrew Bogott
Arturo Borrero Gonzalez
Bryan Davis
David Caro
Nicholas Skaggs
Seyram Komla Sapaty
Taavi Väänänen
Raymond Ndibe
Francesco Negri

Agenda

Meeting facilitator rotation
Toolforge stability issues https://phabricator.wikimedia.org/T335009
Creating APIs from swagger definitions:
- Toolforge builds api - https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api/-/tree/first_commit?ref_type=heads
- Toolforge secrets api - https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-secrets-api/-/tree/kubecon_between_sessions_push?ref_type=heads
Continuous deployment experiments - https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/argocd_helm/README.md?ref_type=heads
Toolforge: consider replacing admission controllers with an existing policy admin project https://phabricator.wikimedia.org/T335131
May 8th buildpack beta

Notes

Meeting facilitator rotation

ABG: Open call to anyone who wishes to take over hosting duties, or to rotate them.

FN: Good to rotate, doesn’t need to be strict. Happy to help in the future. Use a volunteer basis for next time?

ABG: Ok, plan to ask for volunteers each time. Arturo will stay as default if there are no volunteers.

Toolforge stability issues https://phabricator.wikimedia.org/T335009

ABG: Last time we scaled down the grid. DId this cause issues?

AB: Two problems. Yes, scaled down, but also NFS locked up

DC: When NFS went down, some of the jobs started to get stuck and queue up. This increased load on the nodes. Not sure if the grid can be scaled down again. As of now, no processes seem stuck. On the k8s side, some k8s worker nodes got stuck on NFS too.

BD: Handful of pods stuck on an iowait loop. Hard to pinpoint cause.

ABG: NFS problems are usually created by the clients. Server might have a problem, but clients get stuck. I remember in the past there were different NFS client mount options to change behavior. This was introduced long ago, so client behavior was modified then. We should review the client options again.

BD: Had many periods in the past where NFS hiccups caused large cascading problems. Maybe the story is more that we’ve been lucky for a few years.

ABG: Did huge migration from hardware to VMs, a couple weeks of instability was a small price for that.

TV: https://phabricator.wikimedia.org/T257945 says that it was blocked on getting off stretch. We could try newer NFS version now

ABG: Let’s talk again next time and see what happens over the next month.

AB: Would like this to be transitional during the conversion. Don’t know why it happened in the first week, and then not. Agree with Taavi to look at package upgrades.

Creating APIs from swagger definitions

DC: During breaks in kubecon, played around with generating things with swagger. Tried to use it for build-api, seems to be ok. Some open questions, but seems nice. Has anyone ever used it before? Any questions?

TV: Found code kind of hard to follow and read

DC: Since this was generated code, it’s not easy to read often. Perhaps being clear on what’s written by hand versus generated and can be ignored.

ABG: Any tool to generate code for python from a software definition

DC: Should be. Personally wrote a tool long ago for python.

ABG: Would be useful to try generating one with python and see if it’s easier to follow / understand

DC: Used golang mostly for k8s api. Maybe with the client that doesn’t need golang

ABG: Investigated for toolforge jobs. Open phab ticket from back then to implement missing swagger definition. Would like swagger definition for jobs API. Won’t make sense to generate code for jobs framework, since it works

DC: Might still make sense. It generates the boilerplate. Might still make sense for parameter validation, etc.

ABG: One of the issues can be is small inconsistencies between API And client side.

DC: Yes, it could help with validating those things

FN: Played with this to generate a client based on swagger definition. Never shipped it, so limited experience. Need to find out how much work is saved versus cost to use codegen.

TV: If we control both client and server and single language, maybe not needed. If we want others to use it, then consistent versioning and clients for multiple languages is useful.

ABG: People asking to interact with jobs framework with API. But didn’t want to offer API as-is. Would love to be able to present a library and API for developers use.

BD: Toolhub has python code to generate spec, so things stay in sync. Worth looking into. Write server, spit out openapi spec, then feed into a generator that writes the client for you. Generated clients can be nice. The bootstrapping / boilerplate part is nice to have done.

BD: https://swagger.io/blog/api-development/automatically-generating-swagger-specifications-wi/ code generation is one of the main points of the OpenAPI specification

ABG: Feeling of the room is yes, lets use swagger and openAPI

FN: What’s the next thing that needs deployed / would be used by users. Would the secrets API be exposed?

DC: Two ideas. The two repos generated are those ideas. Secrets is new, nothing existing. Not blocking anything, so could push it more. Define and use it as the first implementation. But also toolforge build service API. Will have to build next quarter, could also be a good candidate. It’s complicated, might be harder. Might be interesting because tekton exports libraries for golang. Can use tekton objects directly. POtentially one of those two.

ABG: Agree. Prioritization wise, does this make sense to do now or later

TV: Of the two, secrets feels simpler and safer to try.

Continuous deployment experiments

DC: Try to play with continuous deployment for toolforge. Not actively working on it, but played with it a bit.

ABG: gitops is taking over for serious k8s deployment. One is argocd, then other is fluxx. Gitlab has gitops integration. In the past, there was a k8s agent. Gitlab is migrating to fluxx. Embedding fluxx into the workflow?

DC: Wasn’t gitlab using tekton?

ABG: I knew they were using a gitlab k8s agent. Then moved to fluxx. Maybe tekton for pipelines but not integration?

ABG: Gitops for toolforge. The most challenging feels like the docker image building side of things, especially for toolforge. Kubecon sessions demos often skip the container image building step. You start with an image in a registry ready to go

DC: Is that because there’s a problem building images in gitlab?

ABG: Need to figure out registry, credentials, pipeline, tagging, security, merge requests, etc. To get from the git repository on your laptop to a magical thing that builds in a container.

DC: I think PAWS has this sorted out on github

TV: Thinking about how to improve workflow and automating image building and deployment. Building an image and applying it to a cluster are two different problems. Most projects only focus on one of them.

TV: Automating image build can likely happen without adopting a tool. But probably not a good idea to do manually.

ABG: but PAWS doesn't have to build a docker container image, no? it uses upstream?

TV: PAWS uses custom images via github actions.

BD: PAWS has a whole set of custom images -- https://github.com/toolforge/paws/tree/main/images

BD: PAWS build automation is in github actions like https://github.com/toolforge/paws/blob/main/.github/workflows/renderer.yaml

FN: What did you actually do? Looking at the repo. Is something running?

DC: Have a local cluster via lima-kilo. Deploy yaml, and get argo to start up and see UI

DC: Demoing the UI of what you get

DC: For visualizing, my partner has used lens. Not open source, but likely a similar tool. https://k8slens.dev/

FN: There is also https://k9scli.io/ which is a more limited command-line alternative :) And this one is open source https://github.com/derailed/k9s

ABG: Thoughts on priority for this?

DC: Would love to see it done. Priority is harder. We have more and more APIs and are adding more. Would be nice to have

TV: Automating image builds is something we should explore soon. But adopting gitops tools is less important. Cookbooks are doing the same thing at the moment. Since there’s only two clusters and not bootstrapping clusters often, less of a deal

FN: If you don’t have a clean and reproducible system it’s harder. As we do it, should make deployment easy and reproducible first, then adopting a tool is a consequence

DC: Agree that’s the first step. Publishing helm charts. Even having everything in one repo and manually deploying would help.

TV: /me points to https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_Kubernetes_component_workflow_improvements

DC: Need to figure out how to tag the images

Toolforge: consider replacing admission controllers with an existing policy admin project

ABG: At kubecon, say folks using kyverno. Similar to custom admission controllers we have. Saw several sessions about using kyverno in demos and referring to it. More than OPA gatekeeper. As with everything, competing solutions, etc. Surprised by the simplicity. Created a ticket to discuss. Adopting a tool we could drop custom admission controllers. For example, enforcing NFS mounts, bring your own container, etc.

DC: https://www.openpolicyagent.org/ https://www.kubewarden.io/ (mentioned all as alternatives)

FN: "in just a few lines of yaml" is the claim of so many tools, that in the end you always get too many lines of yaml :D

ABG: One of the demos was using the same enforcement of the registry as we have. And we have a bunch of things to do that. And the demo was using 2 lines of yaml. 1000 LoCs with 2 lines of yaml + kyverno deployment.

FN: I agree it's a very good idea to use something like this

TV: Corresponds well with pod security policy updates

ABG: pod security admission is the replacement for pod security policies. There’s a one to one match between them. Most of the policies we express now we should be able to translate.

DC: There’s a few things we won’t be able to do in the future. Disallows mutation, if so, ise something else.

TV: PSP that can drop privileges, etc. This isn’t declared in the yaml manifests.

ABG: When we have to address this because of PSP going away, let’s consider changing.

May 8th buildpack beta

NS: Build service is “out” and readying for a wider announcement

TV: Is there any documentation yet?

SS: Have quickstart guide: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service/Quickstart#Known_Limitation/Common_Issues. Monday will be tight, but possible

DC: Focusing on docs this week

FN: What tasks are being focused this week?

DC: https://phabricator.wikimedia.org/project/view/6508/

TV: Just looking at the docs. Starts from having something I can already run. But not sure how I can do that

BD: No sshd. Ubuntu upstream base image. Quickstart doesn’t mention any of those differences. Otherwise people will have to figure it out confusingly.

DC: There’s a task to create that page / content. Several differences versus regular images. Images could change in the future. Answering where’s my home, etc. Until it’s finalized, it’ll be in fluxx

TV: How confident are you this will be usable by someone unfamiliar with it by Monday

DC: For simple web application with no db and no storage. Put git URL there, figure out it’s a flask app, build image, run it and done. What might be needed is to add a procfile entry for port 8000, but that’s it. Anything beyond that will need more documentation. If there’s a specific prose you want to see, let us know.

TV: Main concern would be rather to delay announcement versus shipping with incomplete documentation.

TV: Happy to try and share feedback. Can share source code

Action items

David - Keep pushing secrets swagger projects and share
- Test if python client can be generated
Experiment with building container images automatically, maybe gitlab (NEEDS phabricator ticket)
Decided to change admission controls as part of the PSP deprecation
David / Komla - finish docs for build service
David / Taavi - try one or more of Taavi’s projects using the build service