Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems encountered when connecting two virtual machines using bmv2 #1280

Open
git-liusen opened this issue Nov 21, 2024 · 6 comments
Open

Comments

@git-liusen
Copy link

Environment: three kvm virtual machines

         vm1---------------br1----------------------bmv2-vm0--------------------br2----------------vm2
        enp7s0         vnet1 vnet2           enp7s0--bmv2--enp8s0           vnet3 vnet4          enp7s0
   192.168.11.2/24                                                                           192.168.11.5/24
  • When I ping vm2 in vm1, it is connected.
  • However, after I installed the http service in vm2, I could not establish a tcp connection when I used curl to access the http service in vm1.
  • I used trace-cmd to capture the packet processing log and found that the checksum error caused the syn/ack packets to be discarded.
  • When I execute the following command in both virtual machines to turn off the checksum offload function, it can be connect. However, the latency is high, and I find a large number of retransmitted packets when using tcpdump to capture packets.
sudo ethtools -K enp7s0 tx off rx off
image

So I have the following questions

  1. I tested that the bridge connection does not need to turn off the checksum offload function. Why do I need to turn off the checksum offload function when using bmv2?
  2. Do the packets processed by bmv2 pass through the network protocol stack? Can bmv2 use dpdk?
  3. Why can communication be normal after turning off the checksum offload function, but there are a large number of retransmission packets and high network latency?

I see there is a topic for checksum offload.[#1186 ]
What should I do to achieve a normal network connection?
I'm looking forward to your reply!

Below is my p4 code:

/* -*- P4_16 -*- */
#include <core.p4>
#include <v1model.p4>

const bit<16> TYPE_IPV4 = 0x800;
const bit<8>  TYPE_TCP  = 6;
const bit<8>  TYPE_UDP = 17;
const bit<32> I2E_CLONE_SESSION_ID = 100;




/*************************************************************************
*********************** H E A D E R S  ***********************************
*************************************************************************/

typedef bit<9>  egressSpec_t;
typedef bit<48> macAddr_t;
typedef bit<32> ip4Addr_t;


header ethernet_t {
    macAddr_t dstAddr;
    macAddr_t srcAddr;
    bit<16>   etherType;
}

header ipv4_t {
    bit<4>    version;
    bit<4>    ihl;
    bit<8>    diffserv;
    bit<16>   totalLen;
    bit<16>   identification;
    bit<3>    flags;
    bit<13>   fragOffset;
    bit<8>    ttl;
    bit<8>    protocol;
    bit<16>   hdrChecksum;
    ip4Addr_t srcAddr;
    ip4Addr_t dstAddr;
}



header tcp_t{
    bit<16> srcPort;
    bit<16> dstPort;
    bit<32> seqNo;
    bit<32> ackNo;
    bit<4>  dataOffset;
    bit<4>  res;
    bit<1>  cwr;
    bit<1>  ece;
    bit<1>  urg;
    bit<1>  ack;
    bit<1>  psh;
    bit<1>  rst;
    bit<1>  syn;
    bit<1>  fin;
    bit<16> window;
    bit<16> checksum;
    bit<16> urgentPtr;

}

header udp_t {
    bit<16> srcPort;
    bit<16> dstPort;
    bit<16> length;
    bit<16> checksum;
}

//**************************************************************

struct learn_t {
    bit<2> digest;
    bit<48> srcAddr;
    bit<9>  ingress_port;
}

struct metadata {
    learn_t learn;
}

//***************************************************************


struct headers {
    ethernet_t   ethernet;
    ipv4_t       ipv4;
}

/*************************************************************************
*********************** P A R S E R  ***********************************
*************************************************************************/

parser MyParser(packet_in packet,
                out headers hdr,
                inout metadata meta,
                inout standard_metadata_t standard_metadata) {

    state start {
        transition parse_ethernet;
    }

    state parse_ethernet {
        packet.extract(hdr.ethernet);
        transition select(hdr.ethernet.etherType) {
            TYPE_IPV4: parse_ipv4;
            default: accept;
        }
    }

    state parse_ipv4 {
        packet.extract(hdr.ipv4);
        transition accept;
    }
}

/*************************************************************************
************   C H E C K S U M    V E R I F I C A T I O N   *************
*************************************************************************/

control MyVerifyChecksum(inout headers hdr, inout metadata meta) {
    apply {  }
}

control MyIngress(inout headers hdr,
                  inout metadata meta,
                  inout standard_metadata_t standard_metadata) {

    action drop() {
        mark_to_drop(standard_metadata);
    }

    action mac_learn(){
        meta.learn.srcAddr = hdr.ethernet.srcAddr;
        meta.learn.ingress_port = standard_metadata.ingress_port;
        meta.learn.digest = 2;
        digest<learn_t>(1, meta.learn);
    }

    table smac {

        key = {
            hdr.ethernet.srcAddr: exact;
        }

        actions = {
            mac_learn;
            NoAction;
        }
        size = 256;
        default_action = mac_learn;
    }

    action forward(bit<9> egress_port) {
        standard_metadata.egress_spec = egress_port;
    }

    table dmac {
        key = {
            hdr.ethernet.dstAddr: exact;
        }

        actions = {
            forward;
            NoAction;
        }
        size = 256;
        default_action = NoAction;
    }

    action set_mcast_grp(bit<16> mcast_grp) {
        standard_metadata.mcast_grp = mcast_grp;
    }

    table broadcast {
        key = {
            standard_metadata.ingress_port: exact;
        }

        actions = {
            set_mcast_grp;
            NoAction;
        }
        size = 256;
        default_action = NoAction;
    }

    apply {
        //
        smac.apply();
        if (dmac.apply().hit){
            //
        }
        else{
            broadcast.apply();
        }
    }



}

/*************************************************************************
****************  E G R E S S   P R O C E S S I N G   *******************
*************************************************************************/

control MyEgress(inout headers hdr,
                 inout metadata meta,
                 inout standard_metadata_t standard_metadata) {
    apply {

    }
}

/*************************************************************************
*************   C H E C K S U M    C O M P U T A T I O N   **************
*************************************************************************/

control MyComputeChecksum(inout headers  hdr, inout metadata meta) {
     apply {
        update_checksum(
        hdr.ipv4.isValid(),
            { hdr.ipv4.version,
              hdr.ipv4.ihl,
              hdr.ipv4.diffserv,
              hdr.ipv4.totalLen,
              hdr.ipv4.identification,
              hdr.ipv4.flags,
              hdr.ipv4.fragOffset,
              hdr.ipv4.ttl,
              hdr.ipv4.protocol,
              hdr.ipv4.srcAddr,
              hdr.ipv4.dstAddr },
            hdr.ipv4.hdrChecksum,
            HashAlgorithm.csum16);
    }
}

/*************************************************************************
***********************  D E P A R S E R  *******************************
*************************************************************************/

control MyDeparser(packet_out packet, in headers hdr) {
    apply {
        packet.emit(hdr.ethernet);
        packet.emit(hdr.ipv4);



    }
}

/*************************************************************************
***********************  S W I T C H  *******************************
*************************************************************************/

V1Switch(
MyParser(),
MyVerifyChecksum(),
MyIngress(),
MyEgress(),
MyComputeChecksum(),
MyDeparser()
) main;
@jafingerhut
Copy link
Contributor

Others can probably provide more authoritative answers, but I believe that regarding the disabling of rx/tx checksum offload, the basic answer is as follows:

If a NIC driver tells the Linux kernel that rx and tx checksum offload are enabled, then the Linux kernel saves some CPU cycles while processing each packet, because the NIC driver is telling the kernel "you don't have to calculate these checksums, because the NIC will do them for you".

If a NIC driver tells the Linux kernel that rx and tx checksum offload are disabled, then the Linux kernel goes to the extra effort of calculating TCP and UDP checksums itself for each such packet. The extra computation is not terribly large -- it becomes most noticeable at higher network data rates, which should not be an issue in your testing.

I believe that with the virtual NICs used in the kind of setup that you have, e.g. veth pairs, the veth implementation does not implement these checksum offload features. So if the driver tells the Linux kernel that rx/tx offload are enabled, that is actually incorrect, they are not enabled. It is more truthful to disable them, so that the Linux kernel will calculate these checksums.

I do not know the reason for the high latency and retransmissions in your setup. Have you tried also disable rx/tx checksum offload for the interfaces in the VM where the BMv2 simple_switch or simple_switch_grpc process is running?

@antoninbas
Copy link
Member

I would recommend trying to disable scatter-gather (sg) on enp7s0 as well. After that you can try capturing the traffic at each interface to see if an issue shows up.

Do the packets processed by bmv2 pass through the network protocol stack? Can bmv2 use dpdk?

No and no

@git-liusen
Copy link
Author

But I used a bridge instead of BMv2 in the virtual machine to connect ENP7S0 and ENP8S0 together, and they can communicate normally without turning off the checksum offloading function, and there is no retransmission of packets.
I tried to disable the tx and rx verification and uninstallation functions of the virtual machine where Simple_Switch_gRPC is located, and also disable sg, but it did not solve the problem.

The following figure shows the communication status when connected through a bridge
image

@antoninbas
Copy link
Member

But I used a bridge instead of BMv2 in the virtual machine to connect ENP7S0 and ENP8S0 together

That's comparing apples to oranges. When you use a bridge, the traffic is handled by the Linux kernel. When you use the bmv2, all packets are sent to a userspace process (simple_switch_grpc) using raw sockets.

@jafingerhut
Copy link
Contributor

Antonin (or anyone reading this who knows), I know that there is a reliable way to see the full contents of any packet received or transmitted by the BMv2 software switch. Just add a command line option like this to the simple_switch or simple_switch_grpc command line: --dump-packet-data 10000 (the 10000 is the maximum number of bytes of each packet to print in the log).

I know you can use tcpdump or wireshark on veth interfaces to see packets going across them, but it is not clear to me when you do that whether the packet contents are shown before or after checksum calculations are done in the kernel (if they are done in the kernel at all, which they will not be if the NIC tx checksum offloading is enabled).

Having a reliable way to know the contents of the packet at multiple places along the path in a scenario like the one described in this issue would go a long way to understanding if checksumming is the problem.

Note: Even if the checksums of such packets are questionable, the presence or absence of packets shown by tcpdump/wireshark for a veth interface should be 100% accurate, at least when the packet rates are low enough that the CPU load is low.

@antoninbas
Copy link
Member

Given this topology: enp7s0--bmv2--enp8s0, I would run a separate packet capture on both of these virtual interfaces to see if anything interesting shows up.
The checksum settings shouldn't really matter on enp7s0 and enp8s0, given that bmv2 uses raw sockets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants