Toolforge Workgroup meeting 2023-05-02 notes
- Andrew Bogott
- Arturo Borrero Gonzalez
- Bryan Davis
- David Caro
- Nicholas Skaggs
- Seyram Komla Sapaty
- Taavi Väänänen
- Raymond Ndibe
- Francesco Negri
- Meeting facilitator rotation
- Toolforge stability issues https://phabricator.wikimedia.org/T335009
- Creating APIs from swagger definitions:
- Continuous deployment experiments - https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/argocd_helm/README.md?ref_type=heads
- Toolforge: consider replacing admission controllers with an existing policy admin project https://phabricator.wikimedia.org/T335131
- May 8th buildpack beta
Meeting facilitator rotation
ABG: Open call to anyone who wishes to take over hosting duties, or to rotate them.
FN: Good to rotate, doesn’t need to be strict. Happy to help in the future. Use a volunteer basis for next time?
ABG: Ok, plan to ask for volunteers each time. Arturo will stay as default if there are no volunteers.
Toolforge stability issues https://phabricator.wikimedia.org/T335009
ABG: Last time we scaled down the grid. DId this cause issues?
AB: Two problems. Yes, scaled down, but also NFS locked up
DC: When NFS went down, some of the jobs started to get stuck and queue up. This increased load on the nodes. Not sure if the grid can be scaled down again. As of now, no processes seem stuck. On the k8s side, some k8s worker nodes got stuck on NFS too.
BD: Handful of pods stuck on an iowait loop. Hard to pinpoint cause.
ABG: NFS problems are usually created by the clients. Server might have a problem, but clients get stuck. I remember in the past there were different NFS client mount options to change behavior. This was introduced long ago, so client behavior was modified then. We should review the client options again.
BD: Had many periods in the past where NFS hiccups caused large cascading problems. Maybe the story is more that we’ve been lucky for a few years.
ABG: Did huge migration from hardware to VMs, a couple weeks of instability was a small price for that.
TV: https://phabricator.wikimedia.org/T257945 says that it was blocked on getting off stretch. We could try newer NFS version now
ABG: Let’s talk again next time and see what happens over the next month.
AB: Would like this to be transitional during the conversion. Don’t know why it happened in the first week, and then not. Agree with Taavi to look at package upgrades.
Creating APIs from swagger definitions
DC: During breaks in kubecon, played around with generating things with swagger. Tried to use it for build-api, seems to be ok. Some open questions, but seems nice. Has anyone ever used it before? Any questions?
TV: Found code kind of hard to follow and read
DC: Since this was generated code, it’s not easy to read often. Perhaps being clear on what’s written by hand versus generated and can be ignored.
ABG: Any tool to generate code for python from a software definition
DC: Should be. Personally wrote a tool long ago for python.
ABG: Would be useful to try generating one with python and see if it’s easier to follow / understand
DC: Used golang mostly for k8s api. Maybe with the client that doesn’t need golang
ABG: Investigated for toolforge jobs. Open phab ticket from back then to implement missing swagger definition. Would like swagger definition for jobs API. Won’t make sense to generate code for jobs framework, since it works
DC: Might still make sense. It generates the boilerplate. Might still make sense for parameter validation, etc.
ABG: One of the issues can be is small inconsistencies between API And client side.
DC: Yes, it could help with validating those things
FN: Played with this to generate a client based on swagger definition. Never shipped it, so limited experience. Need to find out how much work is saved versus cost to use codegen.
TV: If we control both client and server and single language, maybe not needed. If we want others to use it, then consistent versioning and clients for multiple languages is useful.
ABG: People asking to interact with jobs framework with API. But didn’t want to offer API as-is. Would love to be able to present a library and API for developers use.
BD: Toolhub has python code to generate spec, so things stay in sync. Worth looking into. Write server, spit out openapi spec, then feed into a generator that writes the client for you. Generated clients can be nice. The bootstrapping / boilerplate part is nice to have done.
BD: https://swagger.io/blog/api-development/automatically-generating-swagger-specifications-wi/ code generation is one of the main points of the OpenAPI specification
ABG: Feeling of the room is yes, lets use swagger and openAPI
FN: What’s the next thing that needs deployed / would be used by users. Would the secrets API be exposed?
DC: Two ideas. The two repos generated are those ideas. Secrets is new, nothing existing. Not blocking anything, so could push it more. Define and use it as the first implementation. But also toolforge build service API. Will have to build next quarter, could also be a good candidate. It’s complicated, might be harder. Might be interesting because tekton exports libraries for golang. Can use tekton objects directly. POtentially one of those two.
ABG: Agree. Prioritization wise, does this make sense to do now or later
TV: Of the two, secrets feels simpler and safer to try.
Continuous deployment experiments
DC: Try to play with continuous deployment for toolforge. Not actively working on it, but played with it a bit.
ABG: gitops is taking over for serious k8s deployment. One is argocd, then other is fluxx. Gitlab has gitops integration. In the past, there was a k8s agent. Gitlab is migrating to fluxx. Embedding fluxx into the workflow?
DC: Wasn’t gitlab using tekton?
ABG: I knew they were using a gitlab k8s agent. Then moved to fluxx. Maybe tekton for pipelines but not integration?
ABG: Gitops for toolforge. The most challenging feels like the docker image building side of things, especially for toolforge. Kubecon sessions demos often skip the container image building step. You start with an image in a registry ready to go
DC: Is that because there’s a problem building images in gitlab?
ABG: Need to figure out registry, credentials, pipeline, tagging, security, merge requests, etc. To get from the git repository on your laptop to a magical thing that builds in a container.
DC: I think PAWS has this sorted out on github
TV: Thinking about how to improve workflow and automating image building and deployment. Building an image and applying it to a cluster are two different problems. Most projects only focus on one of them.
TV: Automating image build can likely happen without adopting a tool. But probably not a good idea to do manually.
ABG: but PAWS doesn't have to build a docker container image, no? it uses upstream?
TV: PAWS uses custom images via github actions.
BD: PAWS has a whole set of custom images -- https://github.com/toolforge/paws/tree/main/images
BD: PAWS build automation is in github actions like https://github.com/toolforge/paws/blob/main/.github/workflows/renderer.yaml
FN: What did you actually do? Looking at the repo. Is something running?
DC: Have a local cluster via lima-kilo. Deploy yaml, and get argo to start up and see UI
DC: Demoing the UI of what you get
DC: For visualizing, my partner has used lens. Not open source, but likely a similar tool. https://k8slens.dev/
FN: There is also https://k9scli.io/ which is a more limited command-line alternative :) And this one is open source https://github.com/derailed/k9s
ABG: Thoughts on priority for this?
DC: Would love to see it done. Priority is harder. We have more and more APIs and are adding more. Would be nice to have
TV: Automating image builds is something we should explore soon. But adopting gitops tools is less important. Cookbooks are doing the same thing at the moment. Since there’s only two clusters and not bootstrapping clusters often, less of a deal
FN: If you don’t have a clean and reproducible system it’s harder. As we do it, should make deployment easy and reproducible first, then adopting a tool is a consequence
DC: Agree that’s the first step. Publishing helm charts. Even having everything in one repo and manually deploying would help.
TV: /me points to https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_Kubernetes_component_workflow_improvements
DC: Need to figure out how to tag the images
Toolforge: consider replacing admission controllers with an existing policy admin project
ABG: At kubecon, say folks using kyverno. Similar to custom admission controllers we have. Saw several sessions about using kyverno in demos and referring to it. More than OPA gatekeeper. As with everything, competing solutions, etc. Surprised by the simplicity. Created a ticket to discuss. Adopting a tool we could drop custom admission controllers. For example, enforcing NFS mounts, bring your own container, etc.
DC: https://www.openpolicyagent.org/ https://www.kubewarden.io/ (mentioned all as alternatives)
FN: "in just a few lines of yaml" is the claim of so many tools, that in the end you always get too many lines of yaml :D
ABG: One of the demos was using the same enforcement of the registry as we have. And we have a bunch of things to do that. And the demo was using 2 lines of yaml. 1000 LoCs with 2 lines of yaml + kyverno deployment.
FN: I agree it's a very good idea to use something like this
TV: Corresponds well with pod security policy updates
ABG: pod security admission is the replacement for pod security policies. There’s a one to one match between them. Most of the policies we express now we should be able to translate.
DC: There’s a few things we won’t be able to do in the future. Disallows mutation, if so, ise something else.
TV: PSP that can drop privileges, etc. This isn’t declared in the yaml manifests.
ABG: When we have to address this because of PSP going away, let’s consider changing.
May 8th buildpack beta
NS: Build service is “out” and readying for a wider announcement
TV: Is there any documentation yet?
SS: Have quickstart guide: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service/Quickstart#Known_Limitation/Common_Issues. Monday will be tight, but possible
DC: Focusing on docs this week
FN: What tasks are being focused this week?
TV: Just looking at the docs. Starts from having something I can already run. But not sure how I can do that
BD: No sshd. Ubuntu upstream base image. Quickstart doesn’t mention any of those differences. Otherwise people will have to figure it out confusingly.
DC: There’s a task to create that page / content. Several differences versus regular images. Images could change in the future. Answering where’s my home, etc. Until it’s finalized, it’ll be in fluxx
TV: How confident are you this will be usable by someone unfamiliar with it by Monday
DC: For simple web application with no db and no storage. Put git URL there, figure out it’s a flask app, build image, run it and done. What might be needed is to add a procfile entry for port 8000, but that’s it. Anything beyond that will need more documentation. If there’s a specific prose you want to see, let us know.
TV: Main concern would be rather to delay announcement versus shipping with incomplete documentation.
TV: Happy to try and share feedback. Can share source code
- David - Keep pushing secrets swagger projects and share
- Test if python client can be generated
- Experiment with building container images automatically, maybe gitlab (NEEDS phabricator ticket)
- Decided to change admission controls as part of the PSP deprecation
- David / Komla - finish docs for build service
- David / Taavi - try one or more of Taavi’s projects using the build service