Toolforge Workgroup meeting 2023-07-11 notes

Participants

Andrew Bogott
Arturo Borrero Gonzalez
Bryan Davis
David Caro
Taavi Väänänen
Raymond Ndibe
Slavina Stefanova
Francesco Negri

Agenda

K8s infra rough edges (prompted by https://phabricator.wikimedia.org/T340844)
Expand the toolforge build service workgroup to include toolforge related work (see https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Ongoing_Efforts/Toolforge_Build_Service)
Second round of buildservice
Toolforge deploy status

Notes

K8s infra rough edges

DC: Magnus found some things not working as expected in toolforge-jobs, that led to some bugs being fixed

DC: Are the quotas big enough?

TV: Quotas not implemented https://phabricator.wikimedia.org/T333979

TV: we are not fixing bugs in older projects

ABG: there is a ticket somewhere about using GitOps to automate quotas, that could be a more interesting project to work on https://phabricator.wikimedia.org/T324558

TV: that could be implemented in maintain-kube-users

ABG: where do you define the quotas? In a yaml or a config map?

TV: the defaults could be hardcoded in code

ABG: if you are injecting into k8s via helm, why not injecting all quotas (both defaults and overrides) using Helm charts without using maintain-kube-users?

TV: you cannot configure Helm to automatically create an object in all namespaces including new ones

DC: we need the separation between defaults and overrides, otherwise it’s hard to understand what is being overridden

BD: we have some precedents in the past, in the grid engine there was a “magic directory” to change the quotas of individual tools

DC: what about not fixing bugs

TV: we don’t follow phab projects and don’t really prioritize bug reports, and they get lost

FN: what about having a community wishlist for the Toolforge community, to help us prioritize.

ABG: we were discussing something related in the WMCS team, related to how we do others tasks apart from the quarterly goals. The moment I was focusing on a quarterly goal I stopped paying attention to other activities. I know I’m at the center of many projects, e.g. the jobs framework, and I could follow up on some of the bugs and requests, but I don’t have the time. I cannot work on many major projects at the same time, also moving from network tasks to toolforge is a massive context switch.

TV: we should consider having a triage for all new issues. Does it need immediate attention or can it wait until later? About the idea of a community survey: don’t we ask there about issues and requests?

BD: traditionally we had a free-form question on the annual survey, like “what’s the most important thing WMCS should be working on?”

ABG: I believed the missing features and bugs are already in phabricator

DC: I agree but the last survey was very long and maybe people dropped before finishing it, but it could still indicate which phab tickets we should look at more closely. A long time ago we used to do a round of meetings with community people (probably in person)?

AB: We sometimes had hackathon sessions asking “what do you think is wrong?”

DC: Maybe we could have “community hours” when the community can show up

BD: I can’t speak directly to resourcing, but I made a conscious decision when I was a manager of WMCS not to put out a broad question to the community, because having a giant list and no one to work on those was demotivating. It’s not a bad idea to ask, but it’s something to be careful about. Consider if there’s likely an ability to change direction if the community asks for a different direction

TV: These proposals are about “asking once in a while”, but this is not ideal when we’re talking about “feature X is broken”. We should have a way to notice it right away and not in three months.

ABG: Replying to Bryan: maybe we just don’t have enough manpower to work on everything. It’s another prioritization exercise, how do we do that?

FN: We could separate urgent bugs and feature requests. Are we getting bug reports quickly enough, and are we looking at them?

AB: we’re changing lots of things and until we have a product that’s complete and working, it’s ok to say “when the new product is done, then is the time to fix the bugs”

DC: I think we still want to have a look from time to time. In a similar way to what we did for the build service, we could look at all toolforge tasks every two weeks and choose the ones to work on.

BD: There’s a thing that I used to do and I don’t do as much anymore. There is a shared phabricator search “WMCS extended backlog”, I used to look at this every day, and it shows things that are vaguely related to Toolforge by “last touched time”. Maybe we could start a practice of doing a periodic triage.

FN: it would be nice to have someone like Karen to help with these activities

DC: afaik we won’t have anyone in the near future

BD: we also had contractors doing part-time support, for 6 months it was cool to have Chico around, and his job was to look at IRC, etc., doing both first-tier support and escalating to Arturo or others. It started because some questions similar to these were being discussed in the team.

DC: that was mentioned in the last team meeting

AB: Maybe we can steer Komla into doing more of that role, actively engaging with Toolforge users is the way to know what’s going on, and I’d encourage Komla to do more development work.

DC: In the meantime, I’ll try to update a few things in the Toolforge Working Group, and set up some kind of proposal. It could be just me going through the task once a week and then rotate the role. Similar to what we do right now with the Build Service, to make sure we select a list of tasks we can actually work on instead of having a long list of things.

Second round of buildservice

DC: envvars to store DB credentials

DC: vacations incoming, better wait until after the vacations

BD: it is OK to wait until after the vacations

Toolforge deploy status

ABG: I’m not sure everyone is aware of everything going on and about the changes on how to deploy Toolforge components. Perhaps we can take a few minutes to review the status of things and what we are doing next. Thank you David, you did an amazing work on integrating CI/CD that will allow to introduce GitOps and more fancy things in the future.

DC: Diagram of the current flow

ABG: environments are basically free, we could have a dedicated env for lima-kilo. Often we have a conflict with the local one (used for development without lima-kilo)

DC: I think that would be ok

ABG: The other question is, what about the artifacts? What if we need to roll back one version? Is the previous version still in the repo? How is the cleanup working?

DC: As of right now, in Harbor we are not deleting the “production-type” images, we are only deleting the “merge request-type” images. Right now we have all the tags from the previous “production” commits. I was thinking of setting up some process to only keep the last 5 or something, but I haven’t done it yet.

ABG: It happened to me I was using a chart image and it disappeared from the harbor repo. This is not a big deal, if this happens a lot, we can rethink the cleanup frequency.

DC: You can always re-run the publish job in a merge request, that will rebuild the image and chart and publish again. The garbage collection runs weekly, that will clean up the old images that don’t have a tag anymore. There is also a policy in each project, in theory it will keep the last 5 “merge request” images. Maybe something happened, we can investigate and tweak the policy.

ABG: Also related to the deployment step, you mentioned SSHing to a control node and running a command, I think the cookbook works with the new setup, you just need a different URL and command.

DC: We can change the default params in the cookbook to use the new deployment values

TV: We can make the deploy repo the default.

DC: We still have to figure out secret management, but the only one that needs secrets is the build service at the moment. The builds API will need harbor authentication soon, Raymond is working on it.

DC: you can re-run the build until the merge request is merged, once an image is published it can not be re-written and you need to push another commit.

BD: the toolforge-deploy repo needs a LICENSE file

DC: we can use GPLv3

BD: it’s a good habit to put a README and LICENSE whenever you create a new repo

ABG: remember that most of the new repos are now in GitLab

Action points

David Caro write a proposal on how to triage Toolforge-related tasks