Skip to content

Latest commit

 

History

History
340 lines (264 loc) · 13.9 KB

rdma-ib.md

File metadata and controls

340 lines (264 loc) · 13.9 KB

RDMA with Infiniband

English简体中文

Introduction

This chapter introduces how POD access network with the infiniband interface of the host.

Features

Different from RoCE, Infiniband network cards are proprietary devices for the Infiniband network, and the Spiderpool offers two CNI options:

  1. IB-SRIOV CNI provides SR-IOV network card with the RDMA device. It is suitable for workloads requiring RDMA communication.

    It offers two RDMA modes:

    • Shared mode: Pod will have a SR-IOV network interface with RDMA feature, but all RDMA devices cloud be seen by all PODs running in the same node. POD may be confused for which RDMA device it should use.

    • Exclusive mode: Pod will have a SR-IOV network interface with RDMA feature, and POD just enable to see its own RDMA device.

      For isolated RDMA network cards, at least one of the following conditions must be met:

      (1) Kernel based on 5.3.0 or newer, RDMA modules loaded in the system. rdma-core package provides means to automatically load relevant modules on system start

      (2) Mellanox OFED version 4.7 or newer is required. In this case it is not required to use a Kernel based on 5.3.0 or newer.

  2. IPoIB CNI provides an IPoIB network card for POD, without RDMA device. It is suitable for conventional applications that require TCP/IP communication, as it does not require an SRIOV network card, allowing more PODs to run on the host

RDMA based on IB-SRIOV

The following steps demonstrate how to use IB-SRIOV on a cluster with 2 nodes. It enables POD to own SR-IOV network card and the RDMA devices of isolated network namespace

  1. Ensure that the host machine has an Infiniband card installed and the driver is properly installed.

    In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 VPI NIC.

    Follow the official NVIDIA guide to install the latest OFED driver.

    For Mellanox's VPI series network cards, you can refer to the official Switching Infiniband Mode to ensure that the network card is working in Infiniband mode.

    To confirm the presence of Inifiniband devices, use the following command:

     ~# lspci -nn | grep Infiniband
     86:00.0 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    
     ~# rdma link
     link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 2 sm_lid 2 lmc 0 state ACTIVE physical_state LINK_UP
    
     ~# ibstat mlx5_0 | grep "Link layer"
     Link layer: InfiniBand
    

    Make sure that the RDMA subsystem of the host is in exclusive mode. If not, switch to shared mode.

     ~# rdma system set netns exclusive
     ~# echo "options ib_core netns_mode=0" >> /etc/modprobe.d/ib_core.conf
     ~# reboot
    
     ~# rdma system
     netns exclusive copy-on-fork on
    

    if it is expected to work under shared mode, rm /etc/modprobe.d/ib_core.conf && reboot

    (Optional) In an SR-IOV scenario, applications can enable NVIDIA's GPUDirect RDMA feature. For instructions on installing the kernel module, please refer to the official documentation.

  2. Install Spiderpool, and notice the helm options:

     helm upgrade spiderpool spiderpool/spiderpool --namespace kube-system  --reuse-values --set sriov.install=true
    
    • If you are a user from China, you can specify the parameter --set global.imageRegistryOverride=ghcr.m.daocloud.io to pull image from China registry.

    Once the installation is complete, the following components will be installed:

     ~# kubectl get pod -n kube-system
     spiderpool-agent-9sllh                         1/1     Running     0          1m
     spiderpool-agent-h92bv                         1/1     Running     0          1m
     spiderpool-controller-7df784cdb7-bsfwv         1/1     Running     0          1m
     spiderpool-sriov-operator-65b59cd75d-89wtg     1/1     Running     0          1m
     spiderpool-init                                0/1     Completed   0          1m
    
  3. configure SR-IOV operator.

    Use the following commands to look up the device information of infiniband card

     ~# lspci -nn | grep Infiniband
     86:00.0 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    

    The number of VFs determines how many SR-IOV network cards can be provided for PODs on a host. The network card from different manufacturers have different amount limit of VFs. For example, the Mellanox connectx5 used in this example can create up to 127 VFs.

    Apply the following configuration, and the VFs will be created on the host. Notice, this may cause the nodes to reboot, owing to taking effect the new configuration in the network card driver.

     cat <<EOF | kubectl apply -f -
     apiVersion: sriovnetwork.openshift.io/v1
     kind: SriovNetworkNodePolicy
     metadata:
       name: ib-sriov
       namespace: kube-system
     spec:
       nodeSelector:
         kubernetes.io/os: "linux"
       resourceName: mellanoxibsriov
       priority: 99
       numVfs: 12
       nicSelector:
           deviceID: "1017"
           rootDevices:
           - 0000:86:00.0
           vendor: "15b3"
       deviceType: netdevice
       isRdma: true
     EOF
    

    View the available resources on a node, including the reported RDMA device resources:

     ~# kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]'
     [
       {
         "name": "10-20-1-10",
         "allocable": {
           "cpu": "40",
           "pods": "110",
           "spidernet.io/mellanoxibsriov": "12",
           ...
         }
       },
       ...
     ]  
    
  4. Create the CNI configuration of IB-SRIOV, and the ippool resource

     cat <<EOF | kubectl apply -f -
     apiVersion: spiderpool.spidernet.io/v2beta1
     kind: SpiderIPPool
     metadata:
       name: v4-91
     spec:
       gateway: 172.91.0.1
       ips:
         - 172.91.0.100-172.91.0.120
       subnet: 172.91.0.0/16
     ---
     apiVersion: spiderpool.spidernet.io/v2beta1
     kind: SpiderMultusConfig
     metadata:
       name: ib-sriov
       namespace: kube-system
     spec:
       cniType: ib-sriov
       ibsriov:
         resourceName: spidernet.io/mellanoxibsriov
         ippools:
           ipv4: ["v4-91"]
     EOF
    
  5. Following the configurations from the previous step, create a DaemonSet application that spans across nodes for testing

     ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/ib-sriov"
     RESOURCE="spidernet.io/mellanoxibsriov"
     NAME=ib-sriov
     cat <<EOF | kubectl apply -f -
     apiVersion: apps/v1
     kind: DaemonSet
     metadata:
       name: ${NAME}
       labels:
         app: $NAME
     spec:
       selector:
         matchLabels:
           app: $NAME
       template:
         metadata:
           name: $NAME
           labels:
             app: $NAME
           annotations:
             ${ANNOTATION_MULTUS}
         spec:
           containers:
           - image: docker.io/mellanox/rping-test
             imagePullPolicy: IfNotPresent
             name: mofed-test
             securityContext:
               capabilities:
                 add: [ "IPC_LOCK" ]
             resources:
               limits:
                 ${RESOURCE}: 1
             command:
             - sh
             - -c
             - |
               ls -l /dev/infiniband /sys/class/net
               sleep 1000000
     EOF
    
  6. Verify that RDMA data transmission is working correctly between the Pods across nodes.

    Open a terminal and access one Pod to launch a service:

     ~# rdma link
     link mlx5_4/1 subnet_prefix fe80:0000:0000:0000 lid 8 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP
     
     # Start an RDMA service
     ~# ib_read_lat
    

    Open a terminal and access another Pod to launch a service:

     ~# rdma link
     link mlx5_8/1 subnet_prefix fe80:0000:0000:0000 lid 7 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP
     
     # succeed to visit the service on the other POD
     ~# ib_read_lat 172.91.0.115
    

IPoIB

The following steps demonstrate how to use IPoIB on a cluster with 2 nodes, it enables Pod to own a regular TCP/IP network cards without RDMA device.

  1. Ensure that the host machine has an Infiniband card installed and the driver is properly installed.

    In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 VPI NIC.

    Follow the official NVIDIA guide to install the latest OFED driver.

    For Mellanox's VPI series network cards, you can refer to the official Switching Infiniband Mode to ensure that the network card is working in Infiniband mode.

    To confirm the presence of Inifiniband devices, use the following command:

     ~# lspci -nn | grep Infiniband
     86:00.0 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    
     ~# rdma link
     link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 2 sm_lid 2 lmc 0 state ACTIVE physical_state LINK_UP
    
     ~# ibstat mlx5_0 | grep "Link layer"
     Link layer: InfiniBand
    

    Check the ipoib interface of the Inifiniband device

     ~# ip a show ibs5f0
     9: ibs5f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
     link/infiniband 00:00:10:49:fe:80:00:00:00:00:00:00:e8:eb:d3:03:00:93:ae:10 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
     altname ibp134s0f0
     inet 172.91.0.10/16 brd 172.91.255.255 scope global ibs5f0
     valid_lft forever preferred_lft forever
     inet6 fd00:91::172:91:0:10/64 scope global
     valid_lft forever preferred_lft forever
     inet6 fe80::eaeb:d303:93:ae10/64 scope link
     valid_lft forever preferred_lft forever
    
  2. Install Spiderpool

    If you are a user from China, you can specify the parameter --set global.imageRegistryOverride=ghcr.m.daocloud.io to pull image from China registry.

    Once the installation is complete, the following components will be installed:

     ~# kubectl get pod -n kube-system
     spiderpool-agent-9sllh                         1/1     Running     0          1m
     spiderpool-agent-h92bv                         1/1     Running     0          1m
     spiderpool-controller-7df784cdb7-bsfwv         1/1     Running     0          1m
     spiderpool-init                                0/1     Completed   0          1m
    
  3. Create the CNI configuration of ipoib, and the ippool. The spec.ipoib.master of SpiderMultusConfig should be set to the infiniband interface of the node.

     cat <<EOF | kubectl apply -f -
     apiVersion: spiderpool.spidernet.io/v2beta1
     kind: SpiderIPPool
     metadata:
       name: v4-91
     spec:
       gateway: 172.91.0.1
       ips:
         - 172.91.0.100-172.91.0.120
       subnet: 172.91.0.0/16
     ---
     apiVersion: spiderpool.spidernet.io/v2beta1
     kind: SpiderMultusConfig
     metadata:
       name: ipoib
       namespace: kube-system
     spec:
       cniType: ipoib
       ipoib:
         master: "ibs5f0"
         ippools:
           ipv4: ["v4-91"]
     EOF
    
  4. Following the configurations from the previous step, create a DaemonSet application that spans across nodes for testing

     ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/ipoib"
     NAME=ipoib
     cat <<EOF | kubectl apply -f -
     apiVersion: apps/v1
     kind: DaemonSet
     metadata:
       name: ${NAME}
       labels:
         app: $NAME
     spec:
       selector:
         matchLabels:
           app: $NAME
       template:
         metadata:
           name: $NAME
           labels:
             app: $NAME
           annotations:
             ${ANNOTATION_MULTUS}
         spec:
           containers:
           - image: docker.io/mellanox/rping-test
             imagePullPolicy: IfNotPresent
             name: mofed-test
             securityContext:
               capabilities:
                 add: [ "IPC_LOCK" ]
             command:
             - sh
             - -c
             - |
               ls -l /dev/infiniband /sys/class/net
               sleep 1000000
     EOF
    
  5. Verify that the network communication is correct between the PODs across nodes.

     ~# kubectl get pod -o wide
     NAME                         READY   STATUS             RESTARTS          AGE    IP             NODE         NOMINATED NODE   READINESS GATES
     ipoib-psf4q                  1/1     Running            0                 34s    172.91.0.112   10-20-1-20   <none>           <none>
     ipoib-t9hm7                  1/1     Running            0                 34s    172.91.0.116   10-20-1-10   <none>           <none>
    

    Succeed to access each other

     ~# kubectl exec -it ipoib-psf4q bash
     kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
     root@ipoib-psf4q:/# ping 172.91.0.116
     PING 172.91.0.116 (172.91.0.116) 56(84) bytes of data.
     64 bytes from 172.91.0.116: icmp_seq=1 ttl=64 time=1.10 ms
     64 bytes from 172.91.0.116: icmp_seq=2 ttl=64 time=0.235 ms