Container to container networking performance degradation #293

brunograz · 2023-06-14T11:58:13Z

Moving this issue out of cloudfoundry/cf-networking-release#213 as we have indications that it is related to the Stemcell.

Issue

We currently observe timeouts in C2C when moving CF from Bionic to Jammy.
Please note that this issue can only be observed when the Diego cells are migrated from Bionic to Jammy and cannot be reproduced on Bionic stemcells.
As additional information, we've also tested in different environments with and without dynamic ASGs.

Steps to Reproduce - See additional information below

Install cf-deployment [v27.2.0] on Jammy stemcell
Push two apps and add a network-policy enabling traffic from app A to app B
cf add-network-policy app-a app-b --protocol tcp --port 8080
ssh into app-a and try to reach app-b

Expected result

Successful connections from app-a to app-b.

Current result

Sporadic timeouts and slow connections from app-a to app-b.

[backend-wgnnmafs]: Hello!
real    0m2.035s
user    0m0.000s
sys     0m0.007s
[backend-wgnnmafs]: Hello!
real    0m0.018s
user    0m0.000s
sys     0m0.007s
[backend-wgnnmafs]: Hello!
real    0m1.039s
user    0m0.000s
sys     0m0.007s

Workaround

In every CloudFoundry diego cell you should disable a configuration parameter in the networking interface:
ethtool -K eth0 tx-udp_tnl-segmentation off && ethtool -K eth0 tx-udp_tnl-csum-segmentation off

This is currently disabled (off) by default on Bionic compared to Jammy.

Further information

Infrastructure: ESXI prepared with NSX-T / NSX-V (tested on both) - not sure if it can be reproduced in other cloud environments.

The text was updated successfully, but these errors were encountered:

cunnie · 2023-06-29T17:34:29Z

I'm having a hard time replicating

My throughput is fine; curl between 2 apps on different diego hosts averages ~0.017 seconds; however, these ESXi hosts do not have the NSX vibs installed and the traffic is passed along a regular VDS (Virtual Distributed Switch) instead of an NSX Segment.

Just to check: you're using the CF default Silk backend, right? Not the NSX-T Container Plug-in, right? If you don't know the answer, you're using the Silk backend.

Technical Details

for app in dora-{0,1}; do
  echo -n "$app: "
  cf ssh $app -c 'echo $CF_INSTANCE_ADDR / $CF_INSTANCE_INTERNAL_IP'
done

Gives me:

dora-0: 10.9.250.18:61000 / 10.255.171.137
dora-1: 10.9.250.17:61000 / 10.255.249.140

cf add-network-policy dora-0 dora-1 --protocol tcp --port 8080
cf ssh dora-0
curl http://10.255.249.140:8080 # Hi, I'm Dora!
time for i in $(seq 1 1000); do curl http://10.255.249.140:8080 > /dev/null 2>&1 ; done

real	0m17.550s
user	0m4.991s
sys	0m6.824s

brunograz · 2023-07-07T14:41:41Z

Correct, we are using the Silk backend not the NSX-T plug-in.
I have tried with different Jammy stemcell versions but only the workaround mentioned above worked as these values are disabled by default in Bionic.

metron2 · 2024-02-22T20:32:33Z

We noticed this as well but didn't get to the solution, just kept an isolation segment of bionic cells for the one team using container-to-container networking.

They deploy an nginx server and use the internal app route to talk to their application (essentially to add authentication).

brunograz mentioned this issue Jun 14, 2023

Container to container networking performance degradation - Ubuntu Jammy stemcell cloudfoundry/cf-networking-release#213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container to container networking performance degradation #293

Container to container networking performance degradation #293

brunograz commented Jun 14, 2023

cunnie commented Jun 29, 2023 •

edited

Loading

brunograz commented Jul 7, 2023

metron2 commented Feb 22, 2024 •

edited

Loading

Container to container networking performance degradation #293

Container to container networking performance degradation #293

Comments

brunograz commented Jun 14, 2023

Issue

Steps to Reproduce - See additional information below

Expected result

Current result

Workaround

Further information

cunnie commented Jun 29, 2023 • edited Loading

I'm having a hard time replicating

Technical Details

brunograz commented Jul 7, 2023

metron2 commented Feb 22, 2024 • edited Loading

cunnie commented Jun 29, 2023 •

edited

Loading

metron2 commented Feb 22, 2024 •

edited

Loading