Running RDMA in Containers on Kubernetes and Benchmarking Performance
I got RDMA working in containers on Kubernetes using SR-IOV, so I’m documenting the process including the issues I ran into. I also benchmarked the performance of RDMA vs TCP/IP.
What We’ll Do
- Create SR-IOV VFs (Virtual Functions)
- Assign VFs to containers on Kubernetes
- Add RDMA devices to containers on Kubernetes
- Benchmark RDMA vs TCP/IP performance
Environment
A Kubernetes cluster is built on two Ubuntu machines (AMD EPYC 7282 16-Core Processor) directly connected via Mellanox Technologies MT27800 Family [ConnectX-5] (a RoCEv2-capable 100 Gbps Ethernet NIC). Kubernetes is configured to use this NIC for inter-node communication. The container runtime is containerd.
SR-IOV
We’ll create one VF from the PF (Physical Function). Follow the official documentation1.
First, check the BIOS settings to ensure SR-IOV and IOMMU are enabled. Then add the following grub parameters and reboot:
$ cat /etc/default/grub
...
GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt pci=realloc"
...
Modify the NIC parameters:
$ sudo mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success
$ sudo mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt4119_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:41:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 00
$ sudo mlxconfig -d /dev/mst/mt4119_pciconf0 q
Device #1:
----------
Device type: ConnectX5
Name: MCX515A-CCA_Ax_Bx
Description: ConnectX-5 EN network interface card; 100GbE single-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6
Device: /dev/mst/mt4119_pciconf0
Configurations: Next Boot
...
NUM_OF_VFS 0
SRIOV_EN False(0)
...
$ sudo mlxconfig -d /dev/mst/mt4119_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=1
...
$ sudo mlxconfig -d /dev/mst/mt4119_pciconf0 q
Device #1:
----------
Device type: ConnectX5
Name: MCX515A-CCA_Ax_Bx
Description: ConnectX-5 EN network interface card; 100GbE single-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6
Device: /dev/mst/mt4119_pciconf0
Configurations: Next Boot
...
NUM_OF_VFS 1
SRIOV_EN True(1)
...
After rebooting, run the following:
$ ibv_devices
device node GUID
------ ----------------
mlx5_0 abcdef0300ghijkl
$ echo 1 > /sys/class/net/enp65s0np0/device/sriov_numvfs
$ ibv_devices
device node GUID
------ ----------------
mlx5_0 abcdef0300ghijkl
mlx5_1 0000000000000000
$ ip link
...
4: enp65s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether ab:cd:ef:gh:ij:kl brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
...
9: enp65s0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether lm:no:pq:rs:tu:vw brd ff:ff:ff:ff:ff:ff permaddr 01:23:45:67:89:01
...
At this point, we can confirm that mlx5_1 has been created. TCP/IP communication using the VF is now possible (once you assign an IP address). However, the VF’s GUID is currently 0. There is a known issue where RDMA doesn’t work in this state:
# https://docs.nvidia.com/networking/display/mlnxofedv451010/known+issues
1047616 | Description: When node GUID of a device is set to zero (0000:0000:0000:0000), RDMA_CM user space application may crash.
Workaround: Set node GUID to a nonzero value.
Keywords: RDMA_CM
Set the GUID to a non-zero value. I’m not sure what the correct value should be, but since the physical NIC’s GUID is formed by inserting 0300 in the middle of the MAC address, I followed the same pattern:
$ echo lm:no:pq:03:00:rs:tu:vw | sudo tee /sys/class/net/enp65s0np0/device/sriov/0/node
$ echo 0000:41:00.1 | sudo tee /sys/bus/pci/drivers/mlx5_core/unbind
$ echo 0000:41:00.1 | sudo tee /sys/bus/pci/drivers/mlx5_core/bind
$ ibv_devices
device node GUID
------ ----------------
mlx5_0 abcdef0300ghijkl
mlx5_1 lmnopq0300rstuvw
At this point, the SR-IOV VF is created and ready for RDMA.
Containers and SR-IOV
Now let’s assign the VF we created to a container. The proper approach would be to use SR-IOV CNI plugin2 and Multus3, but this time we’ll do it manually using nerdctl and ip commands.
We’ll proceed assuming you already have a Pod running on Kubernetes. First, find the network namespace and make it manageable with the ip command:
$ NAME=testtest
$ CONTAINERID=$(kubectl get pod $NAME -o json | jq -r '."status"."containerStatuses"[0]."containerID"' | sed 's/containerd:\/\///')
$ PID=$(sudo nerdctl --namespace k8s.io inspect $CONTAINERID --format '{{.State.Pid}}')
$ NETNSNAME=k8s-$NAME
$ sudo ln -s /proc/$PID/ns/net /var/run/netns/$NETNSNAME
Placing a file under /var/run/netns makes it accessible to the ip command.
Move the VF from the host’s network namespace to the container’s network namespace:
$ ip netns
k8s-testtest
$ VFLINKNAME=enp65s0v0
$ VFLINKADDR=192.168.0.1/24
$ sudo ip link set dev $VFLINKNAME netns $NETNSNAME
$ sudo ip -n $NETNSNAME link set dev $VFLINKNAME up
$ sudo ip -n $NETNSNAME addr add $VFLINKADDR dev $VFLINKNAME
Now the container can directly use the VF.
Containers and RDMA
To use RDMA from a container, you need to make the devices under /dev/infiniband visible to the container.
If the container has privileged access, you can simply mount it from the Pod manifest4.
If you don’t want to grant privileges, you need to assign devices to the container using the Kubelet Device API.
Using smarter-device-manager5, you can treat devices as Kubernetes resources, similar to CPU and memory.
First, add the DaemonSet following the official sample:
# https://gitlab.com/arm-research/smarter/smarter-device-manager/-/blob/master/smarter-device-manager-ds.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: smarter-device-manager
namespace: device-manager
labels:
name: smarter-device-manager
role: agent
spec:
selector:
matchLabels:
name: smarter-device-manager
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: smarter-device-manager
annotations:
node.kubernetes.io/bootstrap-checkpoint: "true"
spec:
priorityClassName: "system-node-critical"
hostname: smarter-device-management
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: smarter-device-manager
image: registry.gitlab.com/arm-research/smarter/smarter-device-manager:v1.1.2
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
resources:
limits:
cpu: 100m
memory: 15Mi
requests:
cpu: 10m
memory: 15Mi
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev-dir
mountPath: /dev
- name: sys-dir
mountPath: /sys
- name: config
mountPath: /root/config
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev-dir
hostPath:
path: /dev
- name: sys-dir
hostPath:
path: /sys
- name: config
configMap:
name: smarter-device-manager
Next, specify /dev/infiniband in the ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: smarter-device-manager
namespace: device-manager
data:
conf.yaml: |
- devicematch: infiniband
nummaxdevices: 20
At this point, the Node manifest looks like this:
status:
allocatable:
cpu: "32"
ephemeral-storage: "885012522772"
hugepages-1Gi: 8Gi
hugepages-2Mi: "0"
memory: 123272716Ki
pods: "110"
smarter-devices/infiniband: "20"
capacity:
cpu: "32"
ephemeral-storage: 960300048Ki
hugepages-1Gi: 8Gi
hugepages-2Mi: "0"
memory: 131763724Ki
pods: "110"
smarter-devices/infiniband: "20"
Then add it as a request in the Pod manifest:
spec:
containers:
- ...
resources:
limits:
smarter-devices/infiniband: 1
requests:
smarter-devices/infiniband: 1
Now the container can use RDMA.
By the way, when starting a container with nerdctl without Kubernetes, you can simply add the option --device=/dev/infiniband (the same should work for Docker).
Performance Benchmarking
Using Netperf6, we measured throughput, latency, and CPU time. We also measured connect & close time with custom code. To use RDMA with socket API code, we LD_PRELOADed rsocket7.
- host-remote-tcp: TCP/IP for host-to-host communication
- host-remote-roce: RoCEv2 for host-to-host communication
- flannel-remote-tcp: TCP/IP between containers on different hosts using Flannel as the CNI Plugin
- sriov-remote-roce: RoCEv2 between containers on different hosts using SR-IOV VFs
Detailed descriptions of other conditions are omitted.




RoCEv2 outperforms TCP/IP in throughput, latency, and CPU time. On the other hand, connect & close time is significantly higher with RoCEv2.
Conclusion
Despite various constraints in practical use, RDMA lives up to its reputation.
https://enterprise-support.nvidia.com/s/article/howto-configure-sr-iov-for-connect-ib-connectx-4-with-kvm--infiniband-x#jive_content_id_II_Enable_SRIOV_on_the_MLNX_OFED_driver ↩︎
https://github.com/kubernetes/kubernetes/issues/5607#issuecomment-766089905 ↩︎
https://gitlab.com/arm-research/smarter/smarter-device-manager ↩︎
