Wikimedia Cloud Services team/EnhancementProposals/Decision record T362872 Toolforge policy agent enforcement model

Date of the decision: 2024-05-14

People in the decision meeting (alphabetical order):

Andrew
Arturo
David
Francesco
Raymond
Taavi

Decision taken

Option 2 was chosen, enforcement via mutation.

Rationale

Originally, it was the preference for everyone involved in the decision request via phab discussion to go with option 1 (validation).

Then, Arturo noticed PSP works via mutation, so we had a decision meeting to discuss.

At the end, the votes were:

option 1 validation: 1 vote
option 2 mutation: 3 votes
option 3 combination: 1 vote

Problem

Regardeless of the policy agent we finally decide for Toolforge (see {T362233}), and in addition to that decision, we also need to decide between a couple of options regarding how we want to enforce the different resource security policies, which may have some differences in the semantics and behavior of the platform.

Both Kyverno and OPA Gatekeeper can work in different modes:

enforcement via validation: reject resource definitions that doesn't meet the policies.
enforcement via mutation: mutate resource definitions so they conform with the policies.
no enforcement, only audit: all resources will be evaluated against the policies, and an audit record will be created.

Example of validation:

given a policy that requires every Pod resource to have `allowPrivilegeEscalation: false`
if somebody tries to create a Pod resource with `allowPrivilegeEscalation: true`, reject it. An error message will be produced.

Example of mutation:

given a policy that requires every Pod resource to have `allowPrivilegeEscalation: false`
every time a Pod resource is created, mutate it (modify it) to add `allowPrivilegeEscalation: false`. No error message will be produced.

Example of audit:

given a policy that requires every Pod resource to have `allowPrivilegeEscalation: false`
if a Pod resource doesn't conform to the policy, emit an audit record (but otherwise do nothing else).

Constraints and risks

this affects both for ourselves, in the different -api components we have, and tool developers that have direct access to the k8s API.
semantics are different, and require a different level of commitment, specially for users of the k8s API directly.

Options

Option 1

Enforcement via validation.

This makes everyone explicitly aware of the different policies we have in Toolforge kubernetes, given they have to manually adapt and code to conform to them.

Pros:

possibly the simplest, and perhaps the most classical behavior.
the semantic is explicit: if a policy violation happens, an visible error will be produced.

Cons:

may require code updates, to conform the policies.
given policies can change, these code updates may be required on a continuous basis

Option 2

Enforcement via mutation.

This doesn't makes everyone explicitly aware of the different policies we have in Toolforge kubernetes, because mutation is taking care of updating the resources to conform to policies.

Pros:

transparent enforcement for everyone, no error messages to decode
less code updates to track policy changes

Cons:

people are less aware of the different policies we have in Toolforge kubernetes
a piece of software arbitrarily updating resources sound a bit scary.
it is not clear how mutation would work for policy changes and already present resources. I.e a given Pod was mutated to conform policy on date X. But the policy has now changed. What do we do with the already defined Pod?

Option 3

Combination:

validation for optional policies
mutation for mandatory policies

Given there could be resource attributes that could be optional. We could introduce some kind of mixed approach.

Pros:

maybe the most flexible approach?

Cons:

perhaps the most confusing semantic? as there are things happening automagically, and others requiring explicit code changes.