Wikimedia Cloud Services team/EnhancementProposals/Decision record T362872 Toolforge policy agent enforcement model

From Wikitech

Origin task: phab:T362872

Date of the decision: 2024-04-24

No decision meeting was needed, agreement reached in the task.

Decision taken

Option 1 was chosen, enforcement via validation.

Rationale

It was the preference for everyone involved in the decision request.

Problem

Regardeless of the policy agent we finally decide for Toolforge (see {T362233}), and in addition to that decision, we also need to decide between a couple of options regarding how we want to enforce the different resource security policies, which may have some differences in the semantics and behavior of the platform.

Both Kyverno and OPA Gatekeeper can work in different modes:

  • enforcement via validation: reject resource definitions that doesn't meet the policies.
  • enforcement via mutation: mutate resource definitions so they conform with the policies.
  • no enforcement, only audit: all resources will be evaluated against the policies, and an audit record will be created.

Example of validation:

  • given a policy that requires every Pod resource to have `allowPrivilegeEscalation: false`
  • if somebody tries to create a Pod resource with `allowPrivilegeEscalation: true`, reject it. An error message will be produced.

Example of mutation:

  • given a policy that requires every Pod resource to have `allowPrivilegeEscalation: false`
  • every time a Pod resource is created, mutate it (modify it) to add `allowPrivilegeEscalation: false`. No error message will be produced.

Example of audit:

  • given a policy that requires every Pod resource to have `allowPrivilegeEscalation: false`
  • if a Pod resource doesn't conform to the policy, emit an audit record (but otherwise do nothing else).

Constraints and risks

  • this affects both for ourselves, in the different -api components we have, and tool developers that have direct access to the k8s API.
  • semantics are different, and require a different level of commitment, specially for users of the k8s API directly.

Options

Option 1

Enforcement via validation.

This makes everyone explicitly aware of the different policies we have in Toolforge kubernetes, given they have to manually adapt and code to conform to them.

Pros:

  • possibly the simplest, and perhaps the most classical behavior.
  • the semantic is explicit: if a policy violation happens, an visible error will be produced.

Cons:

  • may require code updates, to conform the policies.
  • given policies can change, these code updates may be required on a continuous basis

Option 2

Enforcement via mutation.

This doesn't makes everyone explicitly aware of the different policies we have in Toolforge kubernetes, because mutation is taking care of updating the resources to conform to policies.

Pros:

  • transparent enforcement for everyone, no error messages to decode
  • less code updates to track policy changes

Cons:

  • people are less aware of the different policies we have in Toolforge kubernetes
  • a piece of software arbitrarily updating resources sound a bit scary.
  • it is not clear how mutation would work for policy changes and already present resources. I.e a given Pod was mutated to conform policy on date X. But the policy has now changed. What do we do with the already defined Pod?

Option 3

Combination:

  • validation for optional policies
  • mutation for mandatory policies

Given there could be resource attributes that could be optional. We could introduce some kind of mixed approach.

Pros:

  • maybe the most flexible approach?

Cons:

  • perhaps the most confusing semantic? as there are things happening automagically, and others requiring explicit code changes.