Wikimedia Cloud Services team/EnhancementProposals/Kubernetes Migration Plan and Timeline
Several technical decisions made upgrading the old cluster in place difficult and costly in terms of time. We have elected to deploy a full modernized Kubernetes cluster in parallel with the old one and migrate users in three phases. This should allow a schedule of regular upgrades to take place on a semi-annual basis without falling significantly behind the upstream development as well as a quick response to CVE announcements with little interruption in services. The new cluster will also be designed for flexibility to migrate and create services beyond existing webservice deployments.
The original cluster was deployed when there was very little multitenant capability natively included in Kubernetes. This, coupled with some particular technical debt that this plan does not fully address yet, required compiled-in custom admission controllers and some customized tooling as well as hand-made docker images that are deployed by script.
The new cluster has to address development burden of compiled-in controllers, tight connections with production puppetizations that meet different requirements that are variously more and less stringent than our own, and inflexibility in design that would block future service changes.
By deploying the new cluster in parallel, we are able to leverage the native kubectl command mechanisms for migrating users (using the current-context field in the config file that is controlled by the command). The updated maintain-kubeusers service that runs on the new cluster prepares Kubernetes user accounts, initial quotas and policies, etc. can merge user x509 certificate information into existing configurations to allow a tool account to simply use the kubectl config use-context command to change to the new cluster and even to revert back if there are problems. To ensure smooth running against every account, this tool will require an initial run by hand. This will be useful in phases 1 and 2. In phase 3 (full deprecation of the old cluster), the old kubectl command will be removed from bastions and the maintain-kubeusers tool will be run to set every account to the new cluster’s context. The old cluster is currently listed as “default”, so the new one will be “toolforge”.
Rough Steps in Order
1. Deploy the new maintain-kubeusers tool in a special mode to run by hand (with the –once argument and –gentle mode) and run it against all Toolforge tool accounts. The gentle mode does not touch the current-context field directly.
2. Deploy the new maintain-kubeusers tool in maintenance mode and gentle mode to handle new accounts.
3. Migrate users via kubectl (phases 1 and 2)
4. Migrate users via the maintain-kubeusers tool in more aggressive mode, which is not yet coded. (phase 3)
5. Decommission the old cluster
With credentials distributed by maintain-kubeusers in gentle mode, migrate 5 tools that are mostly maintained by Toolforge admins or WMCS to the new cluster using kubectl and switching to using the /usr/bin/kubectl binary.
Communicate to the Toolforge community that the Kubernetes upgrade is now ready for open beta. Encourage users to adopt the new cluster early to enable fixing of issues. The only caveat is that there is a mild inconvenience around the kubectl command after they switch (they should use /usr/bin/kubectl).
This phase should see the most education and issues. Users will need to adapt to requesting more memory and cpu on the command line (like they do with Grid engine), and we may run into scaling problems.
As confidence grows that things are figured out, we should declare it general release and provide a deadline for manually moving things over.
Example User Instructions for Phase 2
- Log into a Toolforge bastion (eg. login.tools.wmflabs.org)
webservice stopand wait a minute or so to make sure it has really stopped.
- Switch your Kubernetes "context" to the new cluster:
kubectl config use-context toolforge
- Configure your shell to use a newer version of kubectl:
echo "alias kubectl=/usr/bin/kubectl" >> $HOME/.profile
- Launch your web service with the
webservice --backend=kubernetes ...command just like you used on the old cluster.
- After a moment or so, check to see that things launched successfully at your usual web location (eg. https://tools.wmflabs.org/$mytool).
In phase 3, maintain-kubeusers should be run manually to switch all tools to the “toolforge” context. We should have a clear script prepared to list all services running on the old cluster still, and any that are should stopped and restarted to clear them out. Also remove /usr/local/bin/kubectl on all bastions to simplify command line use.
After these steps, the old cluster should be shut down. It should also start a phase of cleaning up Puppet and webservice to remove cruft from the old cluster.
- 2019-12-16: Finish all preparation for Phase 1 Done
- 2019-12-31: Phase 1 completed and issues remediated Done
- 2020-01-06: Phase 2.1 announced and begun. On 1/9/2020
- 2020-01-27: Phase 2.2 begins
- 2020-02-10: Phase 3 started and quite likely completed. This portion should be somewhat flexible depending on how things go in Phase 2.