Toolforge Workgroup meeting 2023-01-31 notes

Participants

Andrew Bogott
Arturo Borrero Gonzalez
Bryan Davis
David Caro
Nicholas Skaggs
Seyram Komla Sapaty
Taavi Väänänen
Vivian Rook
Kunal Mehta

Agenda

pywikibot @ toolforge kubernetes
- phab:T249787 Create Docker image for Toolforge that is purpose built to run pywikibot scripts
Toolforge kubernetes upgrade cadence
- https://phabricator.wikimedia.org/T325196
https://phabricator.wikimedia.org/T296729: Support ASGI on Toolforge
lima-kilo https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo
- Future use cases? Next steps?

Notes

Thanks Nicholas for capturing notes!

pywikibot @ toolforge kubernetes

AB: Require python container image for pywikibot. Requires bootstrapping venv as well. Bryan compared running pywikibot on grid vs how it can be run in toolforge jobs. This will improve once buildpacks are present / possible. Could create a specific pywikibot image. Could also consider creating a docker pywikibot image; could happen today. https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/. Requires thinking about maintenance and deprecation policies of old pywikibot source code.

NS: Do we really need to wait for buildpacks? Can we do it now? https://phabricator.wikimedia.org/T249787

TV: 2 different use cases with 2 different solutions. Custom scripts -> buildpacks.

DC: Buildpacks as allowed will unlock custom scripts. What about directing user to shell, rather than login to bastion? Could allow running shell on pywikibot environment directly.

DC: Can run a shell directly on a container. Can open a shell on a container image that has pywikibot. Then run commands. Idea was, remove login to bastions, instead connect directly to shell.

Legoktm: What overlap exists with PAWS? PAWS is easier than setting up toolforge tools. Should we redirect some use cases there? Putting pywikibot into a container into the toollabs repo shouldn’t be more work than PAWS

DC: Question about PAWS. Can you start a shell from it? Give access to scripts in the shell? (Answer: yes)

BD: PAWS doesn’t currently provide a solution for long-running jobs. Pywikibot use cases seem to need repeated. Current solutions are difficult.

FN: Can we simplify things? Adopt a more heroku-style one-liner that allows for scheduled pywikibot

AB: Two ideas. 1. Create pywikibot container 2. Bootstrap pywikibot VM. This should be doable today. Might be technical debt? But doable

Legoktm: Pywikibot is staying, we adapt to a container.. But leave to pywikibot team to deal with backwards compatibility

TV: pywikibot + container will unlock a simple one-liner for toolforge jobs

AB: Would it be valid for users to run a different version of pywikibot on every execution?

BD: That’s what happens today with the shared checkout and grid engine. There’s no guarantee that the same or compatible version will launch when your script runs

AB: So create a docker image, upgrade it every month, and see what happens? We’ll have this until buildpacks

Legoktm: There are two flavors of the shared pywikibot install, stable (tagged releases) and nightly. And nightly is usually stable enough for people to use (especially those from the SVN era where you pulled everyday)

AB: Should both be replicated inside docker?

Legoktm: No, just do stable. If they want to use nightly, they are on their own.

DC: Is the proposal to build images in docker and make scriptS?

AB: First, create pywikibot image. Second, bootstrapping python VM for using without docker image. It’s optional. We could document the steps and leave it to users. Or create some kind of shortcut. First step is the easy part. Create and image and make it available for toolforge jobs.

DC: This seems to overlap a lot with toolforge cli and toolforge build service. Could be confusing?

AB: So just the docker image?

DC: Yes, but then think about how the build service and cli interact. Figure out how to make it work

BD: There’s some conflicting requirements here. Nicholas is trying to stay focused on shutting down grid. But in pursuit, there’s multiple projects going on. Worried we might be creating extra changes. We know some of these are intermediate solutions. Be mindful of asking community to change.

Andrew: We could say, don’t migrate off grid engine yet. Continue to think there’s low hanging fruit that can move today. If this keeps happening should we pause the migration?

TV: are buildpacks really the correct solution for people who just want to run a script included with the pywikibot source code?

DC: Is that paws instead?

FN: paws + something for scheduling (either inside paws or separate)

BD points to https://blog.jupyter.org/introducing-jupyter-scheduler-f9e82676c388

Andrew: Continuous things should be cron. Otherwise yes PAWS

TV: But do you need a custom container if you just need to run something packaged?

Andrew: But we can also provide custom code, even if it’s not the user. Buildpacks still facilitate that. Would be nice to have only a single workflow for containers

AB: Beta phase is happening this quarter, perhaps we do nothing then?

FN: Would user experience be different from a build service built image we provide vs a non build service image (docker image, like above)?

Andrew: Small thing, but even telling users to stop tool and use a different image contributes to migration fatigue

VR: The basic problem is that toolforge is too ridged in what it allows as such deps and pywikibot versions can cause problems for people. Meanwhile paws is deemed more flexible in this regard?

TV: If we’re just changing one thing, we could automate it

AB: So are we going to wait for build packs?

Andrew: Proposal: No code, some docs. Recommend waiting

DC: Goals for beta does not include supporting pywikibot. Beta will allow you to run buildpacks, but delivering pywikibot image won’t be part of it

AB: python buildpack?

DC: Upstream python buildpack, custom still in discussion

AB: If python buildpack is included, and tool account has pywikibot code, that will work right?

DC: In theory yes, but not tested / supported for beta. You can do it, but not a goal of the beta

T325196 Toolforge kubernetes upgrade cadence

NS: will create a proposal. Nobody seems to have strong opinions.

FN: Can we do upgrades back to back?

TV: In theory yes, but you need to test and verify

AB: Toolforge uses complex k8s apis that change often. Ingress API, Pod Security policy api that is now deprecated, etc.

FN: What are we trying to optimize for? What’s the objective?

Andrew: Staying on supported versions to get security support

FN: Could do once a year if that’s the objective. So what’s the most time efficient way to be on a supported version.

TV: Yearly upgrade would keep you in support

T296729: Support ASGI on Toolforge

Legoktm: Looked into ASGI support. Should be simple to implement in the current framework. Add ‘generic’ support, allow user to specify command. That’s all we need if container has python, and user creates venv. No one replied on phabricator, so didn’t implement. Could be a simple as an hour to create using the ‘generic’ idea.

DC: Be careful in how we allow specifying the command. Concern about buildpacks and breaking things later. Upstream uses a proc file and reads command. Do we want to stay compatible with upstream?

BD: webservices has a semi-standard way to do it today. Can we keep it?

BD: wesbservices can specify startup command for generic containers. Don’t have a python + generic container today. If you use a golang container, have to specify. If you enabled the same for python, it would allow you to do similar. It’s new things. If we know that it will align with buildpacks, then yes. If we’re unsure, then I would be less convinced.

DC: Upstream has standardized on proc files. Everyone uses them.

TV: Much prefer waiting for buildpack than adding special cases to webservice code.

BD: We can also wait to see if anyone besides Fastly gives +1 to his proposal to make this a wishlist task

Legoktm: Already technically possible using the golang package or similar via hacks. So maybe if we want to wait, use the hack. It’s not officially supported but will work today.

lima-kilo https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo

Future use cases? Next steps?

AB: Created a few months ago. PRoblem Statement. Developing a bunch of code for toolforge, requires local k8s on laptop. Toolforge isn’t a standard k8s deployment, requires custom components. RBAC config, etc. Getting it right is hard. Previously lots of READMEs and scripts were required to create and environment. Impossible to replicate setup between developers. Inspired by mediawiki vagrant project to have a consistent setup. Translated shell scripts into ansible playbooks. Ansible flexible enough to translate across OS and environments. Could standardize how we make a local k8s deployment for working with toolforge. There’s an extension for vagrant to read a playbook and create a VM with it

FN: Really like the idea. Did you try and compress everything into a helm package and let everyone manage any k8s they want? I’ve seen examples of projects using helm one-liners on any k8s cluster. Could remove complexity

AB: With a big helm chart won’t work. Toolforge requires physical mounts, etc. Creating directories on physical machines, etc. For example, canonical directories are assumed for tools using hardcoded pathing. Not easy to do with helm.

DC: What if instead of working around those limitations we instead change toolforge to allow for more configurability? We are consolidating towards helm. Helm package will help. Ideally only minimal setup would be required.

AB: Part of lima-kilo is generating RBAC config.Requires ldap directory we don’t have in local deployment. So project tried to fake everything maintain-kubeusers is doing and get to the basic of toolforge and replicate that in the laptop. Intention isn’t to replicate production toolforge into local setting. Maybe have an ldap directory

VR: While helm is not my favorite, a lot of these problems are resolved (mostly) in helm in paws. The local setup is deployed "the same" as the production setup. Though they still behave a little different in some areas. The work is ongoing, but is largely about removing things that aren't k8s typical things

DC: Do we need to change lima-kilo when webservice changes? Are we adding external dependencies on lima-kilo in other projects? Can we evolve it somehow? For example, can we evolve to deploy in eqiad or codfw, rather than deploy locally?

TV: Long-term goals are good. Lima-kilo is useful and thank you

FN: I think lima-kilo is pragmatically very useful right now yeah

Action items

None in particular. Nonetheless, the debates were useful, valuable.