I got RDMA working in containers on Kubernetes using SR-IOV, so I’m documenting the process including the issues I ran into. I also benchmarked the performance of RDMA vs TCP/IP.

The Japanese version of this article is available here.

What We’ll Do

  • Create SR-IOV VFs (Virtual Functions)
  • Assign VFs to containers on Kubernetes
  • Add RDMA devices to containers on Kubernetes
  • Benchmark RDMA vs TCP/IP performance

Environment

A Kubernetes cluster is built on two Ubuntu machines (AMD EPYC 7282 16-Core Processor) directly connected via Mellanox Technologies MT27800 Family [ConnectX-5] (a RoCEv2-capable 100 Gbps Ethernet NIC). Kubernetes is configured to use this NIC for inter-node communication. The container runtime is containerd.

SR-IOV

We’ll create one VF from the PF (Physical Function). Follow the official documentation1.

First, check the BIOS settings to ensure SR-IOV and IOMMU are enabled. Then add the following grub parameters and reboot:

$ cat /etc/default/grub
...
GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt pci=realloc"
...

Modify the NIC parameters:

$ sudo mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success

$ sudo mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4119_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:41:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00

$ sudo mlxconfig -d /dev/mst/mt4119_pciconf0 q

Device #1:
----------

Device type:        ConnectX5
Name:               MCX515A-CCA_Ax_Bx
Description:        ConnectX-5 EN network interface card; 100GbE single-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6
Device:             /dev/mst/mt4119_pciconf0

Configurations:                                          Next Boot
...
        NUM_OF_VFS                                  0
        SRIOV_EN                                    False(0)
...

$ sudo mlxconfig -d /dev/mst/mt4119_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=1
...
$ sudo mlxconfig -d /dev/mst/mt4119_pciconf0 q

Device #1:
----------

Device type:        ConnectX5
Name:               MCX515A-CCA_Ax_Bx
Description:        ConnectX-5 EN network interface card; 100GbE single-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6
Device:             /dev/mst/mt4119_pciconf0

Configurations:                                          Next Boot
...
        NUM_OF_VFS                                  1
        SRIOV_EN                                    True(1)
...

After rebooting, run the following:

$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              abcdef0300ghijkl
$ echo 1 > /sys/class/net/enp65s0np0/device/sriov_numvfs
$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              abcdef0300ghijkl
    mlx5_1              0000000000000000
$ ip link
...
4: enp65s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether ab:cd:ef:gh:ij:kl brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
...
9: enp65s0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether lm:no:pq:rs:tu:vw brd ff:ff:ff:ff:ff:ff permaddr 01:23:45:67:89:01
...

At this point, we can confirm that mlx5_1 has been created. TCP/IP communication using the VF is now possible (once you assign an IP address). However, the VF’s GUID is currently 0. There is a known issue where RDMA doesn’t work in this state:

# https://docs.nvidia.com/networking/display/mlnxofedv451010/known+issues
1047616 | Description: When node GUID of a device is set to zero (0000:0000:0000:0000), RDMA_CM user space application may crash.
Workaround: Set node GUID to a nonzero value.
Keywords: RDMA_CM

Set the GUID to a non-zero value. I’m not sure what the correct value should be, but since the physical NIC’s GUID is formed by inserting 0300 in the middle of the MAC address, I followed the same pattern:

$ echo lm:no:pq:03:00:rs:tu:vw | sudo tee /sys/class/net/enp65s0np0/device/sriov/0/node
$ echo 0000:41:00.1 | sudo tee /sys/bus/pci/drivers/mlx5_core/unbind
$ echo 0000:41:00.1 | sudo tee /sys/bus/pci/drivers/mlx5_core/bind

$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              abcdef0300ghijkl
    mlx5_1              lmnopq0300rstuvw

At this point, the SR-IOV VF is created and ready for RDMA.

Containers and SR-IOV

Now let’s assign the VF we created to a container. The proper approach would be to use SR-IOV CNI plugin2 and Multus3, but this time we’ll do it manually using nerdctl and ip commands.

We’ll proceed assuming you already have a Pod running on Kubernetes. First, find the network namespace and make it manageable with the ip command:

$ NAME=testtest
$ CONTAINERID=$(kubectl get pod $NAME -o json | jq -r '."status"."containerStatuses"[0]."containerID"' | sed 's/containerd:\/\///')
$ PID=$(sudo nerdctl --namespace k8s.io inspect $CONTAINERID --format '{{.State.Pid}}')
$ NETNSNAME=k8s-$NAME
$ sudo ln -s /proc/$PID/ns/net /var/run/netns/$NETNSNAME

Placing a file under /var/run/netns makes it accessible to the ip command. Move the VF from the host’s network namespace to the container’s network namespace:

$ ip netns
k8s-testtest
$ VFLINKNAME=enp65s0v0
$ VFLINKADDR=192.168.0.1/24
$ sudo ip link set dev $VFLINKNAME netns $NETNSNAME
$ sudo ip -n $NETNSNAME link set dev $VFLINKNAME up
$ sudo ip -n $NETNSNAME addr add $VFLINKADDR dev $VFLINKNAME

Now the container can directly use the VF.

Containers and RDMA

To use RDMA from a container, you need to make the devices under /dev/infiniband visible to the container. If the container has privileged access, you can simply mount it from the Pod manifest4. If you don’t want to grant privileges, you need to assign devices to the container using the Kubelet Device API. Using smarter-device-manager5, you can treat devices as Kubernetes resources, similar to CPU and memory.

First, add the DaemonSet following the official sample:

# https://gitlab.com/arm-research/smarter/smarter-device-manager/-/blob/master/smarter-device-manager-ds.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: smarter-device-manager
  namespace: device-manager
  labels:
    name: smarter-device-manager
    role: agent
spec:
  selector:
    matchLabels:
      name: smarter-device-manager
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: smarter-device-manager
      annotations:
        node.kubernetes.io/bootstrap-checkpoint: "true"
    spec:
      priorityClassName: "system-node-critical"
      hostname: smarter-device-management
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: smarter-device-manager
        image: registry.gitlab.com/arm-research/smarter/smarter-device-manager:v1.1.2
        imagePullPolicy: IfNotPresent
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        resources:
          limits:
            cpu: 100m
            memory: 15Mi
          requests:
            cpu: 10m
            memory: 15Mi
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: dev-dir
            mountPath: /dev
          - name: sys-dir
            mountPath: /sys
          - name: config
            mountPath: /root/config
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: dev-dir
          hostPath:
            path: /dev
        - name: sys-dir
          hostPath:
            path: /sys
        - name: config
          configMap:
            name: smarter-device-manager

Next, specify /dev/infiniband in the ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: smarter-device-manager
  namespace: device-manager
data:
  conf.yaml: |
    - devicematch: infiniband
      nummaxdevices: 20    

At this point, the Node manifest looks like this:

status:
  allocatable:
    cpu: "32"
    ephemeral-storage: "885012522772"
    hugepages-1Gi: 8Gi
    hugepages-2Mi: "0"
    memory: 123272716Ki
    pods: "110"
    smarter-devices/infiniband: "20"
  capacity:
    cpu: "32"
    ephemeral-storage: 960300048Ki
    hugepages-1Gi: 8Gi
    hugepages-2Mi: "0"
    memory: 131763724Ki
    pods: "110"
    smarter-devices/infiniband: "20"

Then add it as a request in the Pod manifest:

spec:
  containers:
  - ...
    resources:
      limits:
        smarter-devices/infiniband: 1
      requests:
        smarter-devices/infiniband: 1

Now the container can use RDMA.

By the way, when starting a container with nerdctl without Kubernetes, you can simply add the option --device=/dev/infiniband (the same should work for Docker).

Performance Benchmarking

Using Netperf6, we measured throughput, latency, and CPU time. We also measured connect & close time with custom code. To use RDMA with socket API code, we LD_PRELOADed rsocket7.

  • host-remote-tcp: TCP/IP for host-to-host communication
  • host-remote-roce: RoCEv2 for host-to-host communication
  • flannel-remote-tcp: TCP/IP between containers on different hosts using Flannel as the CNI Plugin
  • sriov-remote-roce: RoCEv2 between containers on different hosts using SR-IOV VFs

Detailed descriptions of other conditions are omitted.

RoCEv2 outperforms TCP/IP in throughput, latency, and CPU time. On the other hand, connect & close time is significantly higher with RoCEv2.

Conclusion

Despite various constraints in practical use, RDMA lives up to its reputation.