Toolforge Workgroup meeting 2023-04-04 notes

Participants

Andrew Bogott
Arturo Borrero Gonzalez
Bryan Davis
David Caro
Nicholas Skaggs
Seyram Komla Sapaty
Taavi Väänänen
Raymond Ndibe

Agenda

Allow or not NFS on the buildpack-based tools
Toolforge stability issues https://phabricator.wikimedia.org/T333922

Notes

DC: Building toolforge build service, with build packs. Originally planned to build without NFS support, and migrate away from NFS at the same time. That implies building infrastructure to avoid the need for NFS. Over time that original idea has morphed into adopting solutions over time to replace NFS. As of now, no place to put logs for example. Loki / rsyslog, other ideas. But not setup yet. So, do we want to go slow, with multiple migrations as NFS needs are replaced, or to do everything in one migration?

TV: Limit NFS needs in buildpacks if possible. Can we limit to just logs for example?

ABG: Admission controller for every toolforge tool right? It’s agnostic for every container. How does buildpacks affect this? NFS config is injected outside of buildpacks

TV: When pod is created, you can specify if the pod should have NFS mounts or not. So that’s the question for buildpacks if it should specify or not

DC: It’s a question of “should” not “could”

BD: Glad people are thinking about how to limit user fatigue and impact. What are our priorities? If we add more requirements for buildpacks, will that slow down our implementation? If we delay to build out secrets, and log management, and durable storage, etc…

AB: Some future storage wouldn’t have to be dealt with by the user right? AS of now, logs are just magically written. If the location moved to logstash, we would tell them, but users don’t have to do anything right?

DC: In most scenarios yes, but not all tools log to stdout.

BD: Some tools show their own logs today, which would break them

AB: Having synced things, most tools have a .err, .log, etc

TV: We can automate. There will be edge cases, but shouldn’t be too many

DC: Secrets might be easiest / most transparent for users. Upload into k8s, mount it. Some kind of home volume, mounted in the same path as NFS today as well?

ABG: If that level of transparency, we could do that next year without breakage?

DC: Yes, for those two. Logs harder. Cross-tool NFS sharing would be broken. /scratch could still be there

AB: For the most part, tools write to the directory mounted for them. Most shouldn’t notice / care that it changes to private

BD: Magnus does lots of cross-reads

AB: Ok with saying we don’t support that

BD: Taking the opposing view, we’re here to serve the wikis. Arbitrarily breaking lots of tools isn’t helping.

AB: It’s more that power users are power uses. They should be able to be responsive as well. Exploring edge cases needs more attention. Lots of users that followed the tutorial we wouldn’t want to break. Not proposing to break tools and throw them away.

BD: I mention this as it’s not a good assumption someone like Magnus would be willing to maintain / update tools that would be broken.

DC: Buildpacks aren’t mandatory. Not sure if we would ever force adoption of buildpacks. Existing tools don’t have to break.

TV: Buildpacks will require everyone to re-think their tools anyway

ABG: We’re not breaking anything. This is additive. We’re adding something. Not reducing previous functionality.

BD: Yes, and removing the job grid. We believe buildpacks are the last blocker to that removal. Scoped to k8s, yes. Scoped to all of toolforge, maybe not

TV: We’re pushing people off the grid before the replacement is ready

AB: Want users to be using buildpacks soon. Need user input. Want a beta soon, can call it beta. Would rather have a finished product, but don’t want to delay users trying this. Want to see users using and finding value in the product / idea. Going to be incremental no matter what as we can’t anticipate all needs. That might mean NFS will be with us for a while.

TV: If we want to release buildpacks soon, could we release without logging or secrets support?

AB: Seems like a bad idea as we won’t know what’s going on. I like the thinking though!

DC: One of the things being unblocked by buildpacks is multi-stack support. We might want to delay, but really we don’t want to delay migrating from the grid.

ABG: Buildpacks with NFS support would be more incremental. Insert one extra command into the workflow right? Toolforge build. Just one more command in the workflow. Seems reasonable incremental change to suer workflow. What is the drawback / problem with this incremental approach?

DC: Users will have to do some small code changes. And put them into git :-) So changes would go beyond just running a command.

TV: Don’t see incremental new features as bad. But immediately dropping a feature and replacing it is tiring.

FN: When could we have a replacement for logs / secrets? Planning a second changeover off NFS in 6 or 12 months might be ok.

DC: Secrets I think would be easier, logs might be more complicated. I don't know enough to be able to estimate right away, probably nicholas might be able to clarify how much time we will be able to dedicate to it

AB: From an engineering standpoint, NFS is terrible. But from maintenance / user standpoint NFS is fine right? It just mostly works. Not as anxious about phasing out NFS. Would like a proper log solution and secrets solution. But it’s not to eliminate NFS, but rather to provide better features. Is there a need to get rid of NFS beyond desire? A unique thing we offer is collaboration. NFS promotes sharing. We would lose this in more containerized solutions.

DC: NFS might help us get away from LDAP. Though it’s not been as painful lately. That’s further off.

ABG: Don’t think containerized solution for logs / source code would prevent sharing. For example. Logging solution with RBAC, could allow sharing and access to others.

AB: NFS isn’t essential for that, yes. But NFS gives it for free. Don’t love NFS, but it’s an upside for NFS.

ABG: NFS is causing zero problems at the moment.

AB: So far yes 🙂

ABG: Lots of upstream docs on NFS + k8s. Not alone in combining those technologies. Lots of cloud native replacements for NFS things that we should implement at some point.

DC: NFS + K8s is widely used. Ceph NFS for example. Maybe we can shift at some point.

DC: So, answering the question. Buildpacks for beta should we integrate with NFS or not? Proposal to say yes for now, but with an eye to remove logs / secrets asap.

TV: Yes. I wonder if we could make tool specific config for those who need NFS. Like for example, if tools need logs only, mount just logs, not all of NFS. Maybe an implementation detail. Seems we need some form of NFS for now.

BD: If we could easily figure out how to add to web service / job service to opt out of mounting $HOME, that would be useful data

TV: Happy to spend time on trying to figure out what that would look like. Maybe like a feature flag.

AB: Changing topic to toolforge stability issues. In the last two switch operations, had a toolforge outage. The first outage, etcd went down, calico decided to reconfigure itself. The second outage, happened during server migration. And one more minor outage today. Maybe low on CPU/RAM. Should again downscale the grid, and upscale k8s. We did this a few months ago in Feb. Downscale 5 or 6 nodes on grid, adding them to k8s. Can we create new etcd flavor with local storage + CPU and RAM.

AB: Dedicated hardware for etcd are racked and almost ready

DC: When you say scale, is that changing the flavor of worker nodes, or just adding more?

ABG: Two flavors already. The new ones we are adding are larger. The size of the worker is correct. But the control nodes are 2 CPU / 4 gig ram. Seems small. Etcd are similar. Not reasonable for size of cluster

NS: We just had a wave of migrations. Useful to be mindful to do this when that happens

TV: For workers, don’t need to update the old ones. But in the future have to move from docker to containerd and have to touch all of them. Would be a good time to update

BD: A long time ago Brooke and I would judge based on the load on the grid. When the grid got to 50% utilization or so we would downsize it and add the same compute to k8s

NS: Can we scale k8s to the size we should need now, or do we have enough buffer?

ABG: We don’t have a dashboard at the moment, it broke in migration. For ceph we have free storage. (https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad-summary?orgId=1 should be refreshed)

TV: If we make the control nodes larger, theory is the errors would go away

ABG: I believe control nodes have never been reszied since the creation of the cluster

AB: Live resizing of nodes works well, give it a try!

TV: They have special

AB: May not work with old flavors

NS: Speaking of resources, can we change the default quota. Concerned that some tools who could use more won’t ask for it.

TV: Quite possible we could have failed cron jobs or tools hitting quota that aren’t aware they are

AB: Find an example of this being problematic, and I would support raising the quota.

TV: I saw several examples when grepping earlier today

Action items

Ship buildpacks with NFS. Add feature flag to allow opt out of NFS mounts to permit experimentation on moving away from NFS.
- See https://phabricator.wikimedia.org/T334081
Refresh https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad-summary?orgId=1
- See https://phabricator.wikimedia.org/T333975
Re-open ticket / discussion for k8s quotas
- See https://phabricator.wikimedia.org/T333979
Add more resources to toolforge k8s control nodes
- See https://phabricator.wikimedia.org/T333929
Add more worker nodes to toolforge k8s cluster and remove grid nodes
- See https://phabricator.wikimedia.org/T333972