Siim Tiilen, a Quality Engineer in our DevOps team, explains practically how Veriff shares GPUs between pods and how it's helped us reduce our overall infrastructure cost.
Siim Tiilen, May 24th, 2021
ShareLove this blog? Why not share it with the world?
Due to the ever increasing importance of AI in our stack, we are extensively using graphics processing units (GPUs) in Kubernetes to run various machine learning (ML) workloads. In this blog, I’ll describe how we have been sharing GPUs between pods for the last 2 years to massively reduce our infrastructure cost.
When using NVIDIA GPU's, you need to use the NVIDIA device plugin for Kubernetes that declares a new custom resource, nvidia.com/gpu , which you can use to assign GPUs to Kubernetes pods. The issue with this approach is that you cannot split them between multiple applications (PODs) and GPU's are a very expensive resource.
So what happens when you deploy a GPU using applications like this? Kubernetes is using the nvidia.com/gpu resource to deploy this pod into a node where a GPU is available.
If you connect (ssh) into the node where this pod is running, and use nvidia-smi command there, you can get a result similar to this.
The info visible here is that our node has 1 NVIDIA Tesla T4 GPU with 15109MiB of memory and we are using 104MiB of that with one process (our deployed pod).
Internally there is one very important variable that is given to each pod and the application is using this to know which GPUs to use.
kubectl exec -it pod/gpu-example -- bash
#and check echo $NVIDIA_VISIBLE_DEVICES
In the example above, this GPU - GPU-93955ff6-1bbe-3f6d-8d58-a2104edb62db - is being used by the application. When you have a node with multiple GPUs, then actually all pods can access all GPUs - but they all are using this variable to know which GPU they should access.
This variable also has one "magical" value: all. Using this you can override GPU allocation and let your application know that it can use any (all) GPUs present in your node.
We know that our example application is using 104MiB and the GPU has a total 15109MiB, so in theory we can fit it 145 times into the same GPU.
In the next example you need to have AWS EKS cluster with 1 GPU instance (g4dn.xlarge), and with some modifications it should be possible to use it in any Kubernetes cluster with Nvidia GPU nodes available.
If we check after that we can see that all 5 pods are assigned to the same node.
And when we check nvidia-smi in this node we can see there are 5 GPU processes running from those pods.
Now when we know that multiple GPU-using pods can run in the same node, we need to make sure they are split between nodes based on the node's capacity to handle the pod’s requirements. For this we can use Custom Resources to let all GPU nodes know that they have so much GPU memory available to use.
We will be using DaemonSet to add custom GPU memory resource (we named it veriff.com/gpu-memory) to add nodes with an NVIDIA GPU attached. DaemonSet is a special type of Kubernetes deployment that will run 1 pod in some (or all) cluster nodes, it is quite often used to deploy things like log collectors and monitoring tooling. Our DaemonSet will have nodeAffinity to make sure it will only run in g4dn.xlarge nodes.
If you check your node using “kubectl describe node/NAME” you can see it has veriff.com/gpu-memory resource available.
Now when we know that our gpu memory adding script works, let's scale our cluster up so we have multiple GPU nodes available.
In order to test it out, we will modify our deployment to use this new resource.
After that it is visible that pods are split between nodes.
With regular resources like CPU and memory, kubernetes will know how much pods (containers) are actually using and if anyone tries to use more - kubernetes will restrict it. However with our new custom resource, there is no actual safeguard in place that some “evil” pod can’t take more GPU memory than declared. So you need to be very careful when setting the resources there. When pods try to use more GPU memory than a node has available, it usually results in some very ugly crashes.
In Veriff, we resolved this issue with extensive monitoring and alerting for our new GPU Custom Resources usage. We have also developed tooling in-house to measure ML applications GPU usage under load.
All code examples from this post can be found in https://github.com/Veriff/gpu-sharing-examples.