🎉 Announcing new lower pricing — up to 40% lower costs for Cloud Servers and Cloud SQL! Read more →

Kubernetes and Block Storage Volumes

Back in June we announced Block Storage Volumes which (amongst other things) can make running Kubernetes easier by allowing storage to be moved between servers in a cluster.

However, to allow the Kubernetes scheduler to make good decisions about where to run pods, it needs to know which node has which volume and needs to know when that changes.

So how does Kubernetes learn about devices? We tried a few different approaches and hit a few challenges I thought might be interesting to talk about.

First try: the Easy Option(tm)

The first thought was to check whether Kubernetes picked up device changes automatically. There was nothing in the documentation that suggested such a facility existed and watching the kubelet log at a higher log level confirmed that kubelet was completely indifferent to disk attach and detach events.

Can we pick up Inotify events in Kubernetes pods?

With the easy option ruled out it was time to look into how to detect disk changes and transmit them to Kubernetes.

The solution would require a DaemonSet with read-only access to the /dev/disk host path. That would create a pod on all the nodes waiting for disks to attach and detach. Block Storage Volumes can be easily identified when their volume ID appears in the /dev/disk/by-id directory. The changes should show up via iNotify events. One quick container wrapped around incrond later and the concept was proved. The disk change events showed up within a pod on Kubernetes.

Now to get that information to the Kubernetes API server.

Enter Volume Devices

Leaping for the Go compiler is always a last resort. The greatest benefits of open source are to be gained by finding something that does roughly what you need and making small incremental changes to it. Almost certainly somebody has come across the same issue before. Time spent looking around is time well invested.

The first thought was whether our Kubernetes Cloud Controller could be adjusted to report the disk changes to the API server. However the InstanceMetadata interface is very narrow, expecting only the node IP addresses, ID and location. There’s no space for any free form tags.

In addition, the cloud controller is a polling design that updates nodes every five minutes by default. Polling when there are events available would be a very unsatisfactory solution.

Kubernetes Device Plugin Framework

Fortunately Kubernetes provides a device plugin framework that can advertise system resources to the kubelet component. Implementing this interface allows the volumes to be specificed as a resource restriction. Kubernetes will then schedule the pod on the node with that volume attached, or will otherwise wait for the volume to be attached somewhere:

apiVersion: v1
kind: Pod
metadata:
  name: with-volume-requirement
spec:
  containers:
  - name: with-volume-requirement
    image: k8s.gcr.io/pause:2.0
    resources:
      limits:
        volumes.brightbox.com/vol-qsk4v: 1

Installing the volume plugin is as simple as running a DaemonSet - the manifest can be applied directly from our repo:

$ kubectl apply -f https://raw.githubusercontent.com/brightbox/brightbox-volume-device-plugin/main/daemonset.yaml

Caveats

Unfortunately, the plugin API within Kubernetes doesn’t have a mechanism to remove a volume entry once it has been created! So instead, when a volume is detached from a node we set the capacity for it to zero.

The effect is the same, but it does look a bit messy if volumes are moved around nodes a lot.

In addition, Kubernetes sees custom resources as ‘compressible’, which means that if the volume is removed while a pod is using it, then kubelet won’t evict that pod in the way it would if the node ran short of memory or temporary disk space.

This is less of an issue than it sounds, as usually the volume is moved only when there is planned maintenance or an unexpected failure. But you can also setup a liveness probe for your pod to check that storage is available as Kubernetes will kill a pod when one of it’s liveness probes fails. Then the scheduler will be invoked to allocate the new pod to the right node.

Epilogue

Hopefully this short journal gives a flavour of how feature enhancements happen at Brightbox. Generally it starts with an issue sufficiently irritating to require attention, followed by a search to see if the appropriate backscratcher already exists. If not then the closest match will be modified to fit.

We prefer to build from itch rather than from scratch.

Give it a go

We’ve got a guide to installing Kubernetes on Brightbox, and you can sign up in 2 minutes to get a £50 credit so you can experiment with Kubernetes for free!

Get started with Brightbox Sign up takes just two minutes...