Portal:Toolforge/Admin/Toolforge Kubernetes RBAC and PSP

From Wikitech
Jump to navigation Jump to search

This is a proposal for a design of Role-based Access Control (RBAC) and Pod Security Policy (PSP) system that will replace two of the four custom admission controllers currently in use in our Toolforge Kubernetes cluster in order to unblock the upgrade cycle.

This design is being tested in minikube and partially in the live cluster in the toolsbeta project.

Kubernetes RBAC Role-bindings

Both PSPs and Roles are assigned at either the namespace level (rolebinding) or cluster level (clusterrolebinding) through bindings. A role binding links an API object to a user, serviceaccount or similar system object with one or more verbs. These verbs do not universally make sense for all API objects, and the documentation can be sparse outside of code-based, generated docs. In general, Toolforge user accounts are only permitted to act within their particular namespace, and therefore, they usually will have things applied via a rolebinding within the scope of their namespace.

Pod Security Policies

Full documentation on PSPs are available here: https://kubernetes.io/docs/concepts/policy/pod-security-policy/

PSPs are a whitelisting system. This means that, at any given time, the object trying to take an action will use the most permissive policy it's rolebindings have allowed. The (cluster)rolebinding verb here is, literally, "use".

PSPs are defined at the cluster scope, but they can be "use"d in a namespaced fashion, which helps us here.

PSP Proposal Diagram

The privileged policy

In the proposed PSP design, service accounts (automations) in the kube-system namespace can basically do anything. That way the cluster can actually function and controllers work. This "do anything" policy is named "privileged" and is as follows (in yaml).

YAML

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    # See https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp
    # See also https://docs.docker.com/engine/security/seccomp/
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
  name: privileged
spec:
  allowedCapabilities:
  - '*'
  allowPrivilegeEscalation: true
  fsGroup:
    rule: 'RunAsAny'
  hostIPC: true
  hostNetwork: true
  hostPID: true
  hostPorts:
  - min: 0
    max: 65535
  privileged: true
  readOnlyRootFilesystem: false
  runAsUser:
    rule: 'RunAsAny'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'RunAsAny'
  volumes:
  - '*'

Explanation

This policy should also be applied to other overall controllers like the ingress controller and the registry-checking admission controller since they have to run in privileged mode.

This policy is roughly the same as turning Pod Security Policies off for anything that can use it.

System default policy

This policy will not be applied to anything initially, but it is there to be used by services maintained by Toolforge administrators for the good of the system, not for tools themselves. This prevents a service from doing anything in a privileged scope or as root, but it does not specify any particular userid to run as. If we launch jobs or services that don't need to make changes inside Kubernetes itself, this would be the policy to apply. The current proposal for it is as follows:

YAML

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'runtime/default'
    seccomp.security.alpha.kubernetes.io/defaultProfileName:  'runtime/default'
  name: default
spec:
  allowedCapabilities: []  # default set of capabilities are implicitly allowed
  allowPrivilegeEscalation: false
  fsGroup:
    rule: 'MustRunAs'
    ranges:
      # Forbid adding the root group.
      - min: 1
        max: 65535
  hostIPC: false
  hostNetwork: false
  hostPID: false
  privileged: false
  readOnlyRootFilesystem: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'MustRunAs'
    ranges:
      # Forbid adding the root group.
      - min: 1
        max: 65535
  volumes:
  - 'configMap'
  - 'downwardAPI'
  - 'emptyDir'
  - 'projected'
  - 'secret'
  # Restrict host paths by default
  allowedHostPaths:
  - pathPrefix: '/var/lib/sss/pipes'
    readOnly: false
  - pathPrefix: '/data/project'
    readOnly: false
  - pathPrefix: '/public/dumps'
    readOnly: false
  - pathPrefix: '/public/scratch'
    readOnly: false
  - pathPrefix: '/etc/wmcs-project'
    readOnly: true
  - pathPrefix: '/etc/ldap.yaml'
    readOnly: true
  - pathPrefix: '/etc/novaobserver.yaml'
    readOnly: true
  - pathPrefix: '/etc/ldap.conf'
    readOnly: true

Explanation

This is something like what the Toolforge users will have except it does not specify a user ID (just not root) and prevents host mounts other than what users can see. This is meant to keep well-behaved services that need no special privs well-behaved.

Toolforge user policies

Toolforge user accounts, defined by their x509 certificates, each require an automatically-generated PSP in order to restrict their actions to the user id and group id of their accounts. This is defined inside the maintain_kubeusers.py script using API objects, but translated into YAML it looks like:

YAML

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'runtime/default'
    seccomp.security.alpha.kubernetes.io/defaultProfileName: 'runtime/default'
  name: tool-{$username}-psp
spec:
  allowedCapabilities: []
  allowPrivilegeEscalation: false
  fsGroup:
    rule: 'MustRunAs'
    ranges:
      # May only act as the tool group
      - min: $user.id
        max: $user.id
  hostIPC: false
  hostNetwork: false
  hostPID: false
  privileged: false
  readOnlyRootFilesystem: false
  runAsUser:
    rule: 'MustRunAs'
    ranges:
      # May only act as the tool user
      - min: $user.id
        max: $user.id
  seLinux:
    rule: 'RunAsAny'
  runAsGroup:
    rule: 'MustRunAs'
    ranges:
      # May only act as the tool group
      - min: $user.id
        max: $user.id
  supplementalGroups:
    rule: 'MustRunAs'
    ranges:
      # Forbid adding the root group.
      - min: 1
        max: 65535
  volumes:
  - 'configMap'
  - 'downwardAPI'
  - 'emptyDir'
  - 'projected'
  - 'secret'
  - 'hostPath'
  - 'persistentVolumeClaim'
  # Restrict host paths
  allowedHostPaths:
  - pathPrefix: '/var/lib/sss/pipes'
    readOnly: false
  - pathPrefix: '/data/project'
    readOnly: false
  - pathPrefix: '/public/scratch'
    readOnly: false
  - pathPrefix: '/public/dumps'
    readOnly: true
  - pathPrefix: '/etc/wmcs-project'
    readOnly: true
  - pathPrefix: '/etc/ldap.yaml'
    readOnly: true
  - pathPrefix: '/etc/novaobserver.yaml'
    readOnly: true
  - pathPrefix: '/etc/ldap.conf'
    readOnly: true

Explanation

This is applied with a rolebinding, which means that the only place a Toolforge user can launch a pod is in their namespace. They also can only launch a service that has a security context including their user and group ID. They can apply supplemental groups other than the root group, but this is not likely to be used too often. The host paths are the ones currently allowed. Persisistent volumes are not currently in the design, but they are in there to "future proof" these policies. PSPs are defined at the cluster level, but each Toolforge user will have their own because of the UID requirement. That makes large changes annoying at least.

Roles

Root on the controlplane can use the "cluster-admin" role by default. Not much else should be using that. Special roles should be defined for Toolforge services that offer the minimum required capabilities only. Toolforge users can all use the same role defined at the cluster level (a "ClusterRole") with a namespaced role binding.

Toolforge user roles

The Toolforge users all share one cluster role that they can only use within their namespaces.

YAML

# RBAC minimum perms for toolforge users:
# verbs for R/O
# ["get", "list", "watch"]
# verbs for R/W (there are some specific quirks like deletecollection)
# ["get", "list", "watch", "create", "update", "patch", "delete"]
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: tools-user
rules:
  - apiGroups:
    - ""
    resources:
    - bindings
    - events
    - limitranges
    - namespaces
    - namespaces/status
    - persistentvolumeclaims
    - pods/log
    - pods/status
    - replicationcontrollers/status
    - resourcequotas
    - resourcequotas/status
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - ""
    resources:
    - configmaps
    - endpoints
    - pods
    - pods/attach
    - pods/exec
    - pods/portforward
    - pods/proxy
    - replicationcontrollers
    - replicationcontrollers/scale
    - secrets
    - services
    - services/proxy
    verbs:
    - get
    - list
    - watch
    - create
    - delete
    - deletecollection
    - patch
    - update
  - apiGroups:
    - apps
    resources:
    - controllerrevisions
    - daemonsets
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - apps
    resources:
    - deployments
    - deployments/rollback
    - deployments/scale
    - replicasets
    - replicasets/scale
    - statefulsets
    - statefulsets/scale
    verbs:
    - get
    - list
    - watch
    - create
    - delete
    - deletecollection
    - patch
    - update
  - apiGroups:
    - autoscaling
    resources:
    - horizontalpodautoscalers
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - batch
    resources:
    - cronjobs
    - jobs
    verbs:
    - get
    - list
    - watch
    - create
    - delete
    - deletecollection
    - patch
    - update
  - apiGroups:
    - extensions
    resources:
    - daemonsets
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - extensions
    resources:
    - deployments
    - deployments/rollback
    - deployments/scale
    - ingresses
    - networkpolicies
    - replicasets
    - replicasets/scale
    - replicationcontrollers/scale
    verbs:
    - get
    - list
    - watch
    - create
    - delete
    - deletecollection
    - patch
    - update
  - apiGroups:
    - networking.k8s.io
    resources:
    - ingresses
    - networkpolicies
    verbs:
    - get
    - list
    - watch
    - create
    - delete
    - deletecollection
    - patch
    - update
  - apiGroups:
    - policy
    resources:
    - poddisruptionbudgets
    verbs:
    - get
    - list
    - watch

Explanation

The easiest way to visualize all that is as a table.

RBAC Permissions Sorted By API and Resource
API Resource Verbs
CoreV1 (apiGroup: "") bindings get,list,watch
CoreV1 (apiGroup: "") configmaps get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") endpoints get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") events get,list,watch
CoreV1 (apiGroup: "") limitranges get,list,watch
CoreV1 (apiGroup: "") namespaces get,list,watch
CoreV1 (apiGroup: "") namespaces/status get,list,watch
CoreV1 (apiGroup: "") persistentvolumeclaims get,list,watch
CoreV1 (apiGroup: "") pods get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") pods/attach get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") pods/exec get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") pods/log get,list,watch
CoreV1 (apiGroup: "") pods/portforward get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") pods/proxy get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") pods/status get,list,watch
CoreV1 (apiGroup: "") replicationcontrollers get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") replicationcontrollers/scale get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") replicationcontrollers/status get,list,watch
CoreV1 (apiGroup: "") resourcequotas get,list,watch
CoreV1 (apiGroup: "") resourcequotas/status get,list,watch
CoreV1 (apiGroup: "") secrets get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") services get,list,watch,create,delete,deletecollection,patch,update
CoreV1 (apiGroup: "") services/proxy get,list,watch,create,delete,deletecollection,patch,update
ExtensionsV1beta1 (apiGroup: extensions) daemonsets get,list,watch
ExtensionsV1beta1 (apiGroup: extensions) deployments get,list,watch,create,delete,deletecollection,patch,update
ExtensionsV1beta1 (apiGroup: extensions) deployments/rollback get,list,watch,create,delete,deletecollection,patch,update
ExtensionsV1beta1 (apiGroup: extensions) deployments/scale get,list,watch,create,delete,deletecollection,patch,update
ExtensionsV1beta1 (apiGroup: extensions) ingresses get,list,watch,create,delete,deletecollection,patch,update
ExtensionsV1beta1 (apiGroup: extensions) networkpolicies get,list,watch,create,delete,deletecollection,patch,update
ExtensionsV1beta1 (apiGroup: extensions) replicasets get,list,watch,create,delete,deletecollection,patch,update
ExtensionsV1beta1 (apiGroup: extensions) replicasets/scale get,list,watch,create,delete,deletecollection,patch,update
ExtensionsV1beta1 (apiGroup: extensions) replicationcontrollers/scale get,list,watch,create,delete,deletecollection,patch,update
NetworkingV1 (apiGroup: networking.k8s.io) ingresses get,list,watch,create,delete,deletecollection,patch,update
NetworkingV1 (apiGroup: networking.k8s.io) networkpolicies get,list,watch,create,delete,deletecollection,patch,update
PolicyV1beta1 (apiGroup: policy) poddisruptionbudgets get,list,watch
AppsV1 (apiGroup: apps) controllerrevisions get,list,watch
AppsV1 (apiGroup: apps) daemonsets get,list,watch
AppsV1 (apiGroup: apps) deployments get,list,watch,create,delete,deletecollection,patch,update
AppsV1 (apiGroup: apps) deployments/rollback get,list,watch,create,delete,deletecollection,patch,update
AppsV1 (apiGroup: apps) deployments/scale get,list,watch,create,delete,deletecollection,patch,update
AppsV1 (apiGroup: apps) replicasets get,list,watch,create,delete,deletecollection,patch,update
AppsV1 (apiGroup: apps) replicasets/scale get,list,watch,create,delete,deletecollection,patch,update
AppsV1 (apiGroup: apps) statefulsets get,list,watch,create,delete,deletecollection,patch,update
AppsV1 (apiGroup: apps) statefulsets/scale get,list,watch,create,delete,deletecollection,patch,update
BatchV1Api (apiGroup: batch) cronjobs get,list,watch,create,delete,deletecollection,patch,update
BatchV1Api (apiGroup: batch) jobs get,list,watch,create,delete,deletecollection,patch,update
AutoscalingV1Api (apiGroup: autoscaling) horizontalpodautoscalers get,list,watch

The reason there is so much apparent repetition is because in various editions of Kubernetes, the same resources appear under multiple APIs as features are graduated from alpha/beta/extensions into core APIs or the Apps API. In later editions (1.16, for instance) many of the resources under extensions are only found under apps.

Most of this is likely not controversial, but there are some things to consider. Users can do nearly all of this in the current Toolforge. Something new is ingresses and networkpolicies. The reason they can launch ingresses is to be able to launch services that are accessible to the outside, and networkpolicies are, I think, required for ingresses to work properly. That last part about networkpolicies may be worth testing first. Each namespace should have quotas applied so scaling is not something I fear. "poddisruptionbudgets" are an HA feature that isn't something I think we should restrict, per se either. (see https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Another consideration is that we may want to restrict deletecollection in some cases, particularly in configmaps where deleting all configmaps in their namespace will recycle their x509 certs and secrets where they might be able to revoke their own service account credentials inadvertently (rendering Deployments non-functional).

One important note: for this and the PSP for Toolforge users to work right, it must be applied to both the toolforge user and the $namespace:default service account, which is what a replicationcontroller runs as (therefore the thing launching pods in a Deployment object). This last piece hasn't been included in maintain_users.py yet, but it will be before launch.