However, to allow the Kubernetes scheduler to make good decisions about where to run pods, it needs to know which node has which volume and needs to know when that changes.
So how does Kubernetes learn about devices? We tried a few different approaches and hit a few challenges I thought might be interesting to talk about.
The first thought was to check whether Kubernetes picked up device changes automatically. There was nothing in the documentation that suggested such a facility existed and watching the kubelet log at a higher log level confirmed that kubelet was completely indifferent to disk attach and detach events.
With the easy option ruled out it was time to look into how to detect disk changes and transmit them to Kubernetes.
The solution would require a DaemonSet with read-only access to the
host path. That would create a pod on all the nodes waiting for disks to attach
and detach. Block Storage Volumes can be easily identified when their volume ID
appears in the
/dev/disk/by-id directory. The changes should show up via
iNotify events. One
quick container wrapped around incrond later and the concept was proved. The disk
change events showed up within a pod on Kubernetes.
Now to get that information to the Kubernetes API server.
Leaping for the Go compiler is always a last resort. The greatest benefits of open source are to be gained by finding something that does roughly what you need and making small incremental changes to it. Almost certainly somebody has come across the same issue before. Time spent looking around is time well invested.
The first thought was whether our Kubernetes Cloud Controller could be adjusted to report the disk changes to the API server. However the InstanceMetadata interface is very narrow, expecting only the node IP addresses, ID and location. There’s no space for any free form tags.
In addition, the cloud controller is a polling design that updates nodes every five minutes by default. Polling when there are events available would be a very unsatisfactory solution.
Fortunately Kubernetes provides a device plugin framework that can advertise system resources to the kubelet component. Implementing this interface allows the volumes to be specificed as a resource restriction. Kubernetes will then schedule the pod on the node with that volume attached, or will otherwise wait for the volume to be attached somewhere:
- name: with-volume-requirement
Installing the volume plugin is as simple as running a DaemonSet - the manifest can be applied directly from our repo:
$ kubectl apply -f https://raw.githubusercontent.com/brightbox/brightbox-volume-device-plugin/main/daemonset.yaml
Unfortunately, the plugin API within Kubernetes doesn’t have a mechanism to remove a volume entry once it has been created! So instead, when a volume is detached from a node we set the capacity for it to zero.
The effect is the same, but it does look a bit messy if volumes are moved around nodes a lot.
In addition, Kubernetes sees custom resources as ‘compressible’, which means that if the volume is removed while a pod is using it, then kubelet won’t evict that pod in the way it would if the node ran short of memory or temporary disk space.
This is less of an issue than it sounds, as usually the volume is moved only when there is planned maintenance or an unexpected failure. But you can also setup a liveness probe for your pod to check that storage is available as Kubernetes will kill a pod when one of it’s liveness probes fails. Then the scheduler will be invoked to allocate the new pod to the right node.
Hopefully this short journal gives a flavour of how feature enhancements happen at Brightbox. Generally it starts with an issue sufficiently irritating to require attention, followed by a search to see if the appropriate backscratcher already exists. If not then the closest match will be modified to fit.
We prefer to build from itch rather than from scratch.