Portal:Toolforge/Admin/Kubernetes/RBAC and PSP
This is a proposal for a design of Role-based Access Control (RBAC) and Pod Security Policy (PSP) system that will replace two of the four custom admission controllers currently in use in our Toolforge Kubernetes cluster in order to unblock the upgrade cycle.
This design is live in the toolsbeta and tools 2020 Kubernetes clusters.
Kubernetes RBAC Role-bindings
Both PSPs and Roles are assigned at either the namespace level (rolebinding) or cluster level (clusterrolebinding) through bindings. A role binding links an API object to a user, serviceaccount or similar system object with one or more verbs. These verbs do not universally make sense for all API objects, and the documentation can be sparse outside of code-based, generated docs. In general, Toolforge user accounts are only permitted to act within their particular namespace, and therefore, they usually will have things applied via a rolebinding within the scope of their namespace.
Pod Security Policies
Full documentation on PSPs are available here: https://kubernetes.io/docs/concepts/policy/pod-security-policy/
PSPs are a whitelisting system. This means that, at any given time, the object trying to take an action will use the most permissive policy it's rolebindings have allowed. The (cluster)rolebinding verb here is, literally, "use".
PSPs are defined at the cluster scope, but they can be "use"d in a namespaced fashion, which helps us here.

The privileged policy
In the proposed PSP design, service accounts (automations) in the kube-system namespace can basically do anything. That way the cluster can actually function and controllers work. This "do anything" policy is named "privileged" and is as follows (in yaml).
YAML
PodSecurityPolicy YAML |
---|
The following content has been placed in a collapsed box for improved usability. |
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
annotations:
# See https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp
# See also https://docs.docker.com/engine/security/seccomp/
seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
name: privileged
spec:
allowedCapabilities:
- '*'
allowPrivilegeEscalation: true
fsGroup:
rule: 'RunAsAny'
hostIPC: true
hostNetwork: true
hostPID: true
hostPorts:
- min: 0
max: 65535
privileged: true
readOnlyRootFilesystem: false
runAsUser:
rule: 'RunAsAny'
seLinux:
rule: 'RunAsAny'
supplementalGroups:
rule: 'RunAsAny'
volumes:
- '*'
|
The above content has been placed in a collapsed box for improved usability. |
Explanation
This policy should also be applied to other overall controllers like the ingress controller and the registry-checking admission controller since they have to run in privileged mode.
This policy is roughly the same as turning Pod Security Policies off for anything that can use it.
System default policy
This policy will not be applied to anything initially, but it is there to be used by services maintained by Toolforge administrators for the good of the system, not for tools themselves. This prevents a service from doing anything in a privileged scope or as root, but it does not specify any particular userid to run as. If we launch jobs or services that don't need to make changes inside Kubernetes itself, this would be the policy to apply. The current proposal for it is as follows:
YAML
PodSecurityPolicy YAML |
---|
The following content has been placed in a collapsed box for improved usability. |
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
annotations:
seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'runtime/default'
seccomp.security.alpha.kubernetes.io/defaultProfileName: 'runtime/default'
name: default
spec:
allowedCapabilities: [] # default set of capabilities are implicitly allowed
allowPrivilegeEscalation: false
fsGroup:
rule: 'MustRunAs'
ranges:
# Forbid adding the root group.
- min: 1
max: 65535
hostIPC: false
hostNetwork: false
hostPID: false
privileged: false
readOnlyRootFilesystem: false
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
supplementalGroups:
rule: 'MustRunAs'
ranges:
# Forbid adding the root group.
- min: 1
max: 65535
volumes:
- 'configMap'
- 'downwardAPI'
- 'emptyDir'
- 'projected'
- 'secret'
# Restrict host paths by default
allowedHostPaths:
- pathPrefix: '/var/lib/sss/pipes'
readOnly: false
- pathPrefix: '/data/project'
readOnly: false
- pathPrefix: '/public/dumps'
readOnly: false
- pathPrefix: '/public/scratch'
readOnly: false
- pathPrefix: '/etc/wmcs-project'
readOnly: true
- pathPrefix: '/etc/ldap.yaml'
readOnly: true
- pathPrefix: '/etc/novaobserver.yaml'
readOnly: true
- pathPrefix: '/etc/ldap.conf'
readOnly: true
|
The above content has been placed in a collapsed box for improved usability. |
Explanation
This is something like what the Toolforge users will have except it does not specify a user ID (just not root) and prevents host mounts other than what users can see. This is meant to keep well-behaved services that need no special privs well-behaved.
Toolforge user policies
Toolforge user accounts, defined by their x509 certificates, each require an automatically-generated PSP in order to restrict their actions to the user id and group id of their accounts. This is defined inside the maintain_kubeusers.py script using API objects, but translated into YAML it looks like:
YAML
PodSecurityPolicy YAML |
---|
The following content has been placed in a collapsed box for improved usability. |
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
annotations:
seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'runtime/default'
seccomp.security.alpha.kubernetes.io/defaultProfileName: 'runtime/default'
name: tool-{$username}-psp
spec:
requiredDropCapabilities:
- ALL
allowPrivilegeEscalation: false
fsGroup:
rule: 'MustRunAs'
ranges:
# May only act as the tool group
- min: $user.id
max: $user.id
hostIPC: false
hostNetwork: false
hostPID: false
privileged: false
readOnlyRootFilesystem: false
runAsUser:
rule: 'MustRunAs'
ranges:
# May only act as the tool user
- min: $user.id
max: $user.id
seLinux:
rule: 'RunAsAny'
runAsGroup:
rule: 'MustRunAs'
ranges:
# May only act as the tool group
- min: $user.id
max: $user.id
supplementalGroups:
rule: 'MustRunAs'
ranges:
# Forbid adding the root group.
- min: 1
max: 65535
volumes:
- 'configMap'
- 'downwardAPI'
- 'emptyDir'
- 'projected'
- 'secret'
- 'hostPath'
- 'persistentVolumeClaim'
# Restrict host paths
allowedHostPaths:
- pathPrefix: '/var/lib/sss/pipes'
readOnly: false
- pathPrefix: '/data/project'
readOnly: false
- pathPrefix: '/data/scratch'
readOnly: false
- pathPrefix: '/public/dumps'
readOnly: true
- pathPrefix: '/etc/wmcs-project'
readOnly: true
- pathPrefix: '/etc/ldap.yaml'
readOnly: true
- pathPrefix: '/etc/novaobserver.yaml'
readOnly: true
- pathPrefix: '/etc/ldap.conf'
readOnly: true
|
The above content has been placed in a collapsed box for improved usability. |
Explanation
This is applied with a rolebinding, which means that the only place a Toolforge user can launch a pod is in their namespace. They also can only launch a service that has a security context including their user and group ID. They can apply supplemental groups other than the root group, but this is not likely to be used too often. The host paths are the ones currently allowed. Persisistent volumes are not currently in the design, but they are in there to "future proof" these policies. PSPs are defined at the cluster level, but each Toolforge user will have their own because of the UID requirement. That makes large changes annoying at least.
Roles
Root on the controlplane can use the "cluster-admin" role by default. Not much else should be using that. Special roles should be defined for Toolforge services that offer the minimum required capabilities only. Toolforge users can all use the same role defined at the cluster level (a "ClusterRole") with a namespaced role binding.
Toolforge user roles
The Toolforge users all share one cluster role that they can only use within their namespaces.
YAML
ClusterRole YAML |
---|
The following content has been placed in a collapsed box for improved usability. |
# RBAC minimum perms for toolforge users:
# verbs for R/O
# ["get", "list", "watch"]
# verbs for R/W (there are some specific quirks like deletecollection)
# ["get", "list", "watch", "create", "update", "patch", "delete"]
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: tools-user
rules:
- apiGroups:
- ""
resources:
- bindings
- events
- limitranges
- namespaces
- namespaces/status
- persistentvolumeclaims
- pods/log
- pods/status
- replicationcontrollers/status
- resourcequotas
- resourcequotas/status
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- endpoints
- pods
- pods/attach
- pods/exec
- pods/portforward
- pods/proxy
- replicationcontrollers
- replicationcontrollers/scale
- secrets
- services
- services/proxy
verbs:
- get
- list
- watch
- create
- delete
- deletecollection
- patch
- update
- apiGroups:
- apps
resources:
- controllerrevisions
- daemonsets
verbs:
- get
- list
- watch
- apiGroups:
- apps
resources:
- deployments
- deployments/rollback
- deployments/scale
- replicasets
- replicasets/scale
- statefulsets
- statefulsets/scale
verbs:
- get
- list
- watch
- create
- delete
- deletecollection
- patch
- update
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- get
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- jobs
verbs:
- get
- list
- watch
- create
- delete
- deletecollection
- patch
- update
- apiGroups:
- extensions
resources:
- daemonsets
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- deployments
- deployments/rollback
- deployments/scale
- ingresses
- networkpolicies
- replicasets
- replicasets/scale
- replicationcontrollers/scale
verbs:
- get
- list
- watch
- create
- delete
- deletecollection
- patch
- update
- apiGroups:
- networking.k8s.io
resources:
- ingresses
- networkpolicies
verbs:
- get
- list
- watch
- create
- delete
- deletecollection
- patch
- update
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- get
- list
- watch
|
The above content has been placed in a collapsed box for improved usability. |
Explanation
The easiest way to visualize all that is as a table.
API | Resource | Verbs |
---|---|---|
CoreV1 (apiGroup: "") | bindings | get,list,watch |
CoreV1 (apiGroup: "") | configmaps | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | endpoints | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | events | get,list,watch |
CoreV1 (apiGroup: "") | limitranges | get,list,watch |
CoreV1 (apiGroup: "") | namespaces | get,list,watch |
CoreV1 (apiGroup: "") | namespaces/status | get,list,watch |
CoreV1 (apiGroup: "") | persistentvolumeclaims | get,list,watch |
CoreV1 (apiGroup: "") | pods | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | pods/attach | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | pods/exec | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | pods/log | get,list,watch |
CoreV1 (apiGroup: "") | pods/portforward | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | pods/proxy | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | pods/status | get,list,watch |
CoreV1 (apiGroup: "") | replicationcontrollers | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | replicationcontrollers/scale | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | replicationcontrollers/status | get,list,watch |
CoreV1 (apiGroup: "") | resourcequotas | get,list,watch |
CoreV1 (apiGroup: "") | resourcequotas/status | get,list,watch |
CoreV1 (apiGroup: "") | secrets | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | services | get,list,watch,create,delete,deletecollection,patch,update |
CoreV1 (apiGroup: "") | services/proxy | get,list,watch,create,delete,deletecollection,patch,update |
ExtensionsV1beta1 (apiGroup: extensions) | daemonsets | get,list,watch |
ExtensionsV1beta1 (apiGroup: extensions) | deployments | get,list,watch,create,delete,deletecollection,patch,update |
ExtensionsV1beta1 (apiGroup: extensions) | deployments/rollback | get,list,watch,create,delete,deletecollection,patch,update |
ExtensionsV1beta1 (apiGroup: extensions) | deployments/scale | get,list,watch,create,delete,deletecollection,patch,update |
ExtensionsV1beta1 (apiGroup: extensions) | ingresses | get,list,watch,create,delete,deletecollection,patch,update |
ExtensionsV1beta1 (apiGroup: extensions) | networkpolicies | get,list,watch,create,delete,deletecollection,patch,update |
ExtensionsV1beta1 (apiGroup: extensions) | replicasets | get,list,watch,create,delete,deletecollection,patch,update |
ExtensionsV1beta1 (apiGroup: extensions) | replicasets/scale | get,list,watch,create,delete,deletecollection,patch,update |
ExtensionsV1beta1 (apiGroup: extensions) | replicationcontrollers/scale | get,list,watch,create,delete,deletecollection,patch,update |
NetworkingV1 (apiGroup: networking.k8s.io) | ingresses | get,list,watch,create,delete,deletecollection,patch,update |
NetworkingV1 (apiGroup: networking.k8s.io) | networkpolicies | get,list,watch,create,delete,deletecollection,patch,update |
PolicyV1beta1 (apiGroup: policy) | poddisruptionbudgets | get,list,watch |
AppsV1 (apiGroup: apps) | controllerrevisions | get,list,watch |
AppsV1 (apiGroup: apps) | daemonsets | get,list,watch |
AppsV1 (apiGroup: apps) | deployments | get,list,watch,create,delete,deletecollection,patch,update |
AppsV1 (apiGroup: apps) | deployments/rollback | get,list,watch,create,delete,deletecollection,patch,update |
AppsV1 (apiGroup: apps) | deployments/scale | get,list,watch,create,delete,deletecollection,patch,update |
AppsV1 (apiGroup: apps) | replicasets | get,list,watch,create,delete,deletecollection,patch,update |
AppsV1 (apiGroup: apps) | replicasets/scale | get,list,watch,create,delete,deletecollection,patch,update |
AppsV1 (apiGroup: apps) | statefulsets | get,list,watch,create,delete,deletecollection,patch,update |
AppsV1 (apiGroup: apps) | statefulsets/scale | get,list,watch,create,delete,deletecollection,patch,update |
BatchV1Api (apiGroup: batch) | cronjobs | get,list,watch,create,delete,deletecollection,patch,update |
BatchV1Api (apiGroup: batch) | jobs | get,list,watch,create,delete,deletecollection,patch,update |
AutoscalingV1Api (apiGroup: autoscaling) | horizontalpodautoscalers | get,list,watch |
The reason there is so much apparent repetition is because in various editions of Kubernetes, the same resources appear under multiple APIs as features are graduated from alpha/beta/extensions into core APIs or the Apps API. In later editions (1.16, for instance) many of the resources under extensions are only found under apps.
Most of this is likely not controversial, but there are some things to consider. Users can do nearly all of this in the current Toolforge. Something new is ingresses and networkpolicies. The reason they can launch ingresses is to be able to launch services that are accessible to the outside, and networkpolicies are, I think, required for ingresses to work properly. That last part about networkpolicies may be worth testing first. Each namespace should have quotas applied so scaling is not something I fear. "poddisruptionbudgets" are an HA feature that isn't something I think we should restrict, per se either. (see https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Another consideration is that we may want to restrict deletecollection in some cases, particularly in configmaps where deleting all configmaps in their namespace will recycle their x509 certs and secrets where they might be able to revoke their own service account credentials inadvertently (rendering Deployments non-functional).
One important note: for this and the PSP for Toolforge users to work right, it must be applied to both the toolforge user and the $namespace:default service account, which is what a replicationcontroller runs as (therefore the thing launching pods in a Deployment object). This last piece hasn't been included in maintain_users.py yet, but it will be before launch.
Observer role
See also
Some other interesting information related to this topic: