This article is more than one year old. Older articles may contain outdated content. Check that the information in the page has not become incorrect since its publication.
Non-root Containers And Devices
The user/group ID related security settings in Pod's securityContext trigger a problem when users want to
deploy containers that use accelerator devices (via Kubernetes Device Plugins) on Linux. In this blog
post I talk about the problem and describe the work done so far to address it. It's not meant to be a long story about getting the k/k issue fixed.
Instead, this post aims to raise awareness of the issue and to highlight important device use-cases too. This is needed as Kubernetes works on new related features such as support for user namespaces.
Why non-root containers can't use devices and why it matters
One of the key security principles for running containers in Kubernetes is the
principle of least privilege. The Pod/container securityContext specifies the config
options to set, e.g., Linux capabilities, MAC policies, and user/group ID values to achieve this.
Furthermore, the cluster admins are supported with tools like PodSecurityPolicy (deprecated) or
Pod Security Admission (alpha) to enforce the desired security settings for pods that are being deployed in
the cluster. These settings could, for instance, require that containers must be runAsNonRoot or
that they are forbidden from running with root's group ID in runAsGroup or supplementalGroups.
In Kubernetes, the kubelet builds the list of Device resources to be made available to a container
(based on inputs from the Device Plugins) and the list is included in the CreateContainer CRI message
sent to the CRI container runtime. Each Device contains little information: host/container device
paths and the desired devices cgroups permissions.
The OCI Runtime Spec for Linux Container Configuration expects that in addition to the devices cgroup fields, more detailed information about the devices must be provided:
{
        "type": "<string>",
        "path": "<string>",
        "major": <int64>,
        "minor": <int64>,
        "fileMode": <uint32>,
        "uid": <uint32>,
        "gid": <uint32>
},
The CRI container runtimes (containerd, CRI-O) are responsible for obtaining this information
from the host for each Device. By default, the runtimes copy the host device's user and group IDs:
- uid(uint32, OPTIONAL) - id of device owner in the container namespace.
- gid(uint32, OPTIONAL) - id of device group in the container namespace.
Similarly, the runtimes prepare other mandatory config.json sections based on the CRI fields,
including the ones defined in securityContext: runAsUser/runAsGroup, which become part of the POSIX
platforms user structure via:
- uid(int, REQUIRED) specifies the user ID in the container namespace.
- gid(int, REQUIRED) specifies the group ID in the container namespace.
- additionalGids(array of ints, OPTIONAL) specifies additional group IDs in the container namespace to be added to the process.
However, the resulting config.json triggers a problem when trying to run containers with
both devices added and with non-root uid/gid set via runAsUser/runAsGroup: the container user process
has no permission to use the device even when its group id (gid, copied from host) was permissive to
non-root groups. This is because the container user does not belong to that host group (e.g., via additionalGids).
Being able to run applications that use devices as non-root user is normal and expected to work so that the security principles can be met. Therefore, several alternatives were considered to get the gap filled with what the PodSec/CRI/OCI supports today.
What was done to solve the issue?
You might have noticed from the problem definition that it would at least be possible to workaround
the problem by manually adding the device gid(s) to supplementalGroups, or in
the case of just one device, set runAsGroup to the device's group id. However, this is problematic because the device gid(s) may have
different values depending on the nodes' distro/version in the cluster. For example, with GPUs the following commands for different distros and versions return different gids:
Fedora 33:
$ ls -l /dev/dri/
total 0
drwxr-xr-x. 2 root root         80 19.10. 10:21 by-path
crw-rw----+ 1 root video  226,   0 19.10. 10:42 card0
crw-rw-rw-. 1 root render 226, 128 19.10. 10:21 renderD128
$ grep -e video -e render /etc/group
video:x:39:
render:x:997:
Ubuntu 20.04:
$ ls -l /dev/dri/
total 0
drwxr-xr-x 2 root root         80 19.10. 17:36 by-path
crw-rw---- 1 root video  226,   0 19.10. 17:36 card0
crw-rw---- 1 root render 226, 128 19.10. 17:36 renderD128
$ grep -e video -e render /etc/group
video:x:44:
render:x:133:
Which number to choose in your securityContext? Also, what if the runAsGroup/runAsUser values cannot be hard-coded because
they are automatically assigned during pod admission time via external security policies?
Unlike volumes with fsGroup, the devices have no official notion of deviceGroup/deviceUser that the CRI runtimes (or kubelet)
would be able to use. We considered using container annotations set by the device plugins (e.g., io.kubernetes.cri.hostDeviceSupplementalGroup/) to get custom OCI config.json uid/gid values.
This would have required changes to all existing device plugins which was not ideal.
Instead, a solution that is seamless to end-users without getting the device plugin vendors involved was preferred. The selected approach was
to re-use runAsUser and runAsGroup values in config.json for devices:
{
        "type": "c",
        "path": "/dev/foo",
        "major": 123,
        "minor": 4,
        "fileMode": 438,
        "uid": <runAsUser>,
        "gid": <runAsGroup>
},
With runc OCI runtime (in non-rootless mode), the device is created (mknod(2)) in
the container namespace and the ownership is changed to runAsUser/runAsGroup using chmod(2).
Note:
Rootless mode and devices is not supported.runAsUser/runAsGroup
are taken into account, and, e.g., the USER setting in the container is currently ignored.
While it is likely that the "faulty" deployments (i.e., non-root securityContext + devices) do not exist, to be absolutely sure no
deployments break, an opt-in config entry in both containerd and CRI-O to enable the new behavior was added. The following:
device_ownership_from_security_context (bool)
defaults to false and must be enabled to use the feature.
See non-root containers using devices after the fix
To demonstrate the new behavior, let's use a Data Plane Development Kit (DPDK) application using hardware accelerators, Kubernetes CPU manager, and HugePages as an example. The cluster runs containerd with:
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    device_ownership_from_security_context = true
or CRI-O with:
[crio.runtime]
device_ownership_from_security_context = true
and the Guaranteed QoS Class Pod that runs DPDK's crypto-perf test utility with this YAML:
...
metadata:
  name: qat-dpdk
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 2000
    fsGroup: 3000
  containers:
  - name: crypto-perf
    image: intel/crypto-perf:devel
    ...
    resources:
      requests:
        cpu: "3"
        memory: "128Mi"
        qat.intel.com/generic: '4'
        hugepages-2Mi: "128Mi"
      limits:
        cpu: "3"
        memory: "128Mi"
        qat.intel.com/generic: '4'
        hugepages-2Mi: "128Mi"
  ...
To verify the results, check the user and group ID that the container runs as:
$ kubectl exec -it qat-dpdk -c crypto-perf -- id
They are set to non-zero values as expected:
uid=1000 gid=2000 groups=2000,3000
Next, check the device node permissions (qat.intel.com/generic exposes /dev/vfio/ devices) are accessible to runAsUser/runAsGroup:
$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/vfio
total 0
drwxr-xr-x 2 root root      140 Sep  7 10:55 .
drwxr-xr-x 7 root root      380 Sep  7 10:55 ..
crw------- 1 1000 2000 241,   0 Sep  7 10:55 58
crw------- 1 1000 2000 241,   2 Sep  7 10:55 60
crw------- 1 1000 2000 241,  10 Sep  7 10:55 68
crw------- 1 1000 2000 241,  11 Sep  7 10:55 69
crw-rw-rw- 1 1000 2000  10, 196 Sep  7 10:55 vfio
Finally, check the non-root container is also allowed to create HugePages:
$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/hugepages/
fsGroup gives a runAsUser writable HugePages emptyDir mountpoint:
total 0
drwxrwsr-x 2 root 3000   0 Sep  7 10:55 .
drwxr-xr-x 7 root root 380 Sep  7 10:55 ..
Help us test it and provide feedback!
The functionality described here is expected to help with cluster security and the configurability of device permissions. To allow
non-root containers to use devices requires cluster admins to opt-in to the functionality by setting
device_ownership_from_security_context = true. To make it a default setting, please test it and provide your feedback (via SIG-Node meetings or issues)!
The flag is available in CRI-O v1.22 release and queued for containerd v1.6.
More work is needed to get it properly supported. It is known to work with runc but it also needs to be made to function
with other OCI runtimes too, where applicable. For instance, Kata Containers supports device passthrough and allows it to make devices
available to containers in VM sandboxes too.
Moreover, the additional challenge comes with support of user names and devices. This problem is still open and requires more brainstorming.
Finally, it needs to be understood whether runAsUser/runAsGroup are enough or if device specific settings similar to fsGroups are needed in PodSpec/CRI v2.
Thanks
My thanks goes to Mike Brown (IBM, containerd), Peter Hunt (Redhat, CRI-O), and Alexander Kanevskiy (Intel) for providing all the feedback and good conversations.