Wikimedia Cloud Services team/EnhancementProposals/Toolforge push to deploy

For user documentation, see Help:Toolforge/Build Service. This page describes an early technical plan of what eventually became the build service.

Tracked in Phabricator
Task T194332

Summary

Users of modern commercial PaaS's increasingly have access to hosted developer pipelines. These provide code hosting, build, and deployment services, all on-demand via a commit hook to the hosted repository. The concept of "push-to-deploy" can be thought of as a subset of GitOps. A user does not require shell access to deploy their code. Instead by pushing to a remote repository, with some additional settings stored somewhere for any secrets, the code can be automatically deployed.

The Toolforge community of developers would also benefit from similar integrations. The rise of developer oriented hosted services has lowered the technical barrier to entry, as well as the monetary cost for developing and maintaining a public service.

Goals

Lower the barrier of entry for Toolforge users who are familiar with other commercial PaaS's
Enable a modern "push-to-deploy" workflow, wherein pushed code undergoes a CI/CD process, lowering the requirements for tool developers
Allow for user-built/managed containers to run in a controlled fashion on toolforge
Reduce runtime dependence on NFS (network concerns with NFS and using private addresses already)
Lower dependence on gridengine (eliminate gridengine webservers, leaving only the jobs)
Long-term vision? Single workflow for all of toolforge (aka, one way to do things), customizable by users

Alternatives?

Can we avoid NIH and adopt an off the shelf solution?

[bstorm] Don't want to roll our own solution. Want to stay simple.

[lego] do we want the entire solution to be an upstream project? Or are we okay with using upstream building blocks and writing custom stuff to tie it all together? The latter seems like the direction we're going in

Openshift?
- Originally not based on k8s, heavily customized k8s now by redhat. No longer a viable solution

Could build CI portion of this and ignore git hosting

chain that starts buildpack and deploys. Ignore how it gets there. User could construct there own git hook to start the process.

User Stories

People who need extra dependencies:
- Need software that's already packaged in Debian
  - Various C/C++ libraries (in -dev packages) for Python/nodejs/etc. libraries to use
  - imagemagick/rsvg type CLI tools for processing
  - tesseract, which needs to be backported from newer Debian versions
- Software that isn't packaged (or the packaged version is incompatible) will need gcc, make, etc. that isn't needed at runtime
some gridengine use cases (spell them out?)
- multi-runtime tools. There are tools today on the grid engine webservice backend which mix php, python, perl, ... into the same tool account.
make it possible to run a tool webservice without any NFS runtime dependency by including all the necessary application code inside the container's image
- [lego] we'd need some solution for logging (which would be nice to get off of nfs anyways)
Different build and deploy images. Especially for compiled languages (C, Go, Rust, etc.), the build image will need compilers, -dev packages, but the deploy image can be much smaller, and maybe nothing for statically-compiled stuff
People who are good at code and not so good at using ssh shells.
Live-hacking/debugging/testing tools without necessarily pushing stuff to git
Deploying changes for the {foo} tool to a {foo}-beta tool first, then the main one

What is this?

[bryan] Similar to heroku. Sign up for account, get a namespace. Each one of the deployments in heroku gives you a magic git url that is the push to deploy URL. It's an extra remote repo, it's not for code development, only for release. Stryker could create repos inside gitlab for development, toolforge? then provides push to deploy git URL (add remote, git push)

[arturo] used gitlab with debian. Putting effort into CI/CD machinery. In debian, you create an account on salsa, you get free repos, free collab, can join, etc. As soon as you commit to a repo, you have full access to CI/CD.

[bstorm] Github/Gitlab is different model. Tightly linked to admin (of which we are not). This model would also move more of cloud services into production. How can we auth, can prod network contact cloud k8s, where do we store images? gitlab can do everything, if we are admins. Gitlab quasi-admins, build plugins, build workflows inside of gitlab? we would heavily integrate with gitlab and production. Perhaps being able to add "toolforge' to any repo on gitlab is useful.

workflow clone repo, buildpack, deploy

Concerns

Can't open to custom Images today, largely because of storage today (not even on NFS, easily filled, no quotas). Also enforcing OSS software? Lastly, webservice assumes image you are running can connect to LDAP and figure out user and NFS (buildpacks could have base image that has the stuff we need, so we control the base and they can layer on top)

Existing Toolforge Setup

TODO: Picture representation of Toolforge Workflow

Today, Toolforge users store containers in NFS. These container images are built on top of provided base images from WMCS. These containers are missing some of the details that would be required for automatic deployment, including authentication. Deployment is manual, via curated kubernetes commands. Behind the scenes, information required to launch is transferred automagically via a git service, including which container image to launch.

Presuming avoidance of a configuration-heavy user experience as is needed in some PaaS offerings out there, this experience is fairly close to the desired "push-to-deploy" model.

Implementation

TODO: Picture representation of desired Toolforge Workflow

[?] A git hosting service that users push to, triggering a webhook.
- Whether we need a separate git hosting service depends on whether we want separate special push-to-deploy URLs and what production plans on doing.
A webhook receiver that triggers a CI/CD pipeline
- Question: What CI/CD system are we going to use? Argo?
Using buildpacks, create an image for that tool/project
- We use file detection (or maybe service.template) to figure out what language runtimes are needed and then add those layers in.
- The builder needs to have access to the previously built image for caching purposes
- Question: how do we get language runtimes installed, without root? Maybe something like https://github.com/heroku/heroku-buildpack-apt ??
Deploy a new k8s deployment on the tool
- We will need a new pod security policy that allows images to run under the new system user, but prevents mounting of any NFS besides dumps.
- These images will not have NFS mounted (except perhaps dumps, which are read only), but we'll copy over replica.my.cnf.
  - Notably, this means that services will not have access to any filesystem-based persistent storage (still can access ToolsDB/redis/etc.)
- Our current images assume you will mount a host's sssd directories. This is not actually necessary, except for NFS read-write.

buildpacks architecture

Two new docker images: toolforge-buildpack-build and toolforge-buildpack-run. These are the base images for the build and run steps, respectively. For now, these are based on top of the toolforge-buster-sssd image for simplicity, but it would be nice to discard all the various editors and tools we don't need. These images set up the system user account, ID: TBD, so that all building and running happens as non-root. This will be safe because of the new PSP and because there's no tool NFS access.

Example of what a python37 buildpack workflow looks like:

python37:
- Looks for type: python3.7 in service.template. If this doesn't match, it'll try a different language runtime or error out if it can't match any of them
- Installs Python 3.7, currently from Debian packages
- Installs pip from pypa and then virtualenv from pypi
- Provides python (version 3.7), pip, virtualenv
pip (optional)
- Looks for a requirements.txt file
- Creates a virtualenv, installs dependencies
- Requires python, any version and virtualenv
uwsgi
- Unconditionally used.
- Installs uwsgi into a virtualenv
- Sets launch process to uwsgi ... with roughly the same configuration as webservice-runner.
- Requires python, any version and virtualenv

The pip pack is independent of specific Python versions, so it could sit on top of a future python38 pack. We could also add a poetry pack if people want to use that instead of pip.

Definition of Success

"Trivial" Support for new languages and toolchains. Adding support for a new language or toolchain today requires custom image building and testing by WMCS.
Adoption or Usage metric?

Ideas

The road so far

[bstorm]

I've been kicking the tires on buildpacks, and that should move along just fine with a more flexible and appropriate image registry. That piece should be ok with enough work and documentation of our particular modifications and solutions.

It is immediately apparent that there is a solution that can instrument Kubernetes, has an internal docker registry, and can do all the git, CD and auth we'd need--Gitlab. Unfortunately, Gitlab CE, the open source edition is seriously lacking when it comes to LDAP integration (no groups!) and wants to be more of the core of the setup than I think we'd want it to be in order to make it work. It is also so general purpose that it would be more difficult to limit what users do with it to keep things productive for the movement. It may be possible with some custom API clients or plugins to make it work, but I suspect we will end up spending more time making it work than we will get good use out of it. Gitlab is also an enormous project that would consume easily one tech's full attention to properly support once we start doing lots of customizing.

[legoktm]

I looked into buildpacks, there are some really nice things, but I'm not sure about some other "features". They have this idea that the buildpacks should be independent of the base layer ("stack") and that as long as there's a consistent ABI, you can swap out the stack e.g. for a newer OS. This means that we can't use OS packages, though, so we'd need to have builds (or use someone elses) for every language runtime (see how ruby is downloaded and installed in https://buildpacks.io/docs/buildpack-author-guide/create-buildpack/build-app/). This might work nicely for languages that have these kinds of builds available and easy to use (e.g. Rust via rustup) but I don't think there are that many. Also I'm not sold on the consistent ABI thing, given how key libraries like libicu regularly change.

I think if we assume we're stuck on a specific base OS, we can use OS-level packages, which prevents us from needing to figure out how to build/provide language runtimes. buildpacks does have the concept of OS-provided packages: https://buildpacks.io/docs/concepts/components/stack/#mixins

The nicest part about buildpacks is that you can decide what layers you want to add in by analyzing the source code. So we could look at `service.template` to figure out which language runtime we need. In the Python case, we could see if requirements.txt exists, or pyproject.toml, Pipfile, etc., using the correct tool to install dependencies as needed. And then we copy in the source and build the image.

The next things to look at

[bstorm]

I am looking at doing experimentation with gitea as the git layer for this. It is more fully open source, integrates well with LDAP, has a large number of features that are quite impressive, and enables it to authenticate with other tools nicely (even as an OAUTH2 and maybe OIDC provider). The small resource footprint is also attractive. Notably, Openstack has been adopting it at https://opendev.org/, so we would be moving in the same circles as well. It is also good that at least some of our team already has experience with it. Some experimentation should give us more info.

Harbor would seem to be a great possibility for docker image management. It's a CNCF incubator project that probably does the trick. It's a bit heavier than needed, but it's also multitenant, which brings up possibilities like splitting the repo so that users who over-provision somehow only hurt their own project, etc.

From there, it may be worth looking at Argo again. This is something that others in the org are already working on, so we may even benefit from cross-team collaboration or at least quizzing. Since its claim to fame is purely just putting things in git onto Kubernetes, then maybe it can be made to work one way or another.

[phamhi]

I have personally use gitea and harbor. Both are great tools and their helm charts work out of the box.

Open Question

[nskaggs] Does this work move us closer to deprecation of gridengine in any way, ie by enabling support of things only possible on gridengine today?
- It should be a goal.. This + a jobs service could replace everything
GitLab + https://docs.gitlab.com/ce/topics/autodevops/customize.html ?
Is this toolforge 2.0?
- Originally was thinking about this as toolforge 2.0. But later pivoted away from transparently move away from gridengine onto k8s?
Gitlab Feature Request for automagical creation of git repositories for projects
- if you push to a URL (that you have permission for) gitlab will automatically create it, which is both good and bad --lego
What about secrets management? Does gitlab have a solution for this?

Things that grid engine can do that k8s can't yet:

multiple runtime languages in the same tool (eg php + perl + python)
External libraries/tools like tesseract
"easy" cron jobs -- not likley to be fixed by push to deploy at all

Things nothing in Toolforge can do today:

custom deb packages per tool (apt-get install ...)

Next steps:

something that starts a build for a build pack for at least one workflow (this is most important, and most flexible), using CD system of some kind (argo?) and hooks into toolforge k8s
concurrently, someone needs to work on registry side (access issues?) harbor.io in ceph object stores
- [bd808] registry not needed until we need to scale (let's not do this first)
- [bstorm] want to trial out auth, so we should mess with this?

Notes

To avoid being blocked by deciding on a git hosting platform, for now we should only use platform-agnostic features, like webhooks (or simulated webhooks via CI). We can revisit having tighter integration with e.g. GitLab later on.
Start with no NFS (but secrets management?)
- we will copy over replica.my.cnf but leave the rest as not implemented/TODO
- log everything to stdout/stderr

Workflow:

triggered based on some workflow
run pack to create new image for project
- access to the docker registery can be given by whitelisting IPs, it's controlled in apache
have CD solution (argo?) talk to k8s and restart webservice (not actually using `webservice`)