Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

one direction bandwidth testing fail with GPUdirect #289

Open
ilovesouthpark opened this issue Sep 8, 2024 · 2 comments
Open

one direction bandwidth testing fail with GPUdirect #289

ilovesouthpark opened this issue Sep 8, 2024 · 2 comments

Comments

@ilovesouthpark
Copy link

Hello,

I am testing my 2 P100 in 2 nodes with 2 cx555 NICs.
It is only successful from one direction but failed in the other.
Success
./ib_write_bw --use_cuda=0 -a 10.10.10.11
./ib_write_bw -d mlx5_0 --use_cuda=0 -a

Fail
./ib_write_bw --use_cuda=0 -a
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdma_cm
Failed to exchange data between server and clients

./ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.10.10.10
Completion with error at client
Failed status 4: wr_id 0 syndrom 0x51
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully

For the testing between both cx555 NICs the bandwidth testings work well.

Driver and Kernel:
Both cx555 are the same driver and firmware
Both P100 are th same driver but different vbios
I am not using Nvidia open source kernel since P100 is not supported but i think it is not the problem of the kernel otherwise why one direction is still working.

For IOMMU
10.10.10.11
sudo dmesg | grep -i dmar
[ 0.173076] DMAR: IOMMU disabled
sudo dmesg | grep -i iommu
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=44a5d7a3-4f19-4106-8a8c-66301c2c9d14 ro intel_iommu=off quiet splash vt.handoff=7
[ 0.173010] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=44a5d7a3-4f19-4106-8a8c-66301c2c9d14 ro intel_iommu=off quiet splash vt.handoff=7
[ 0.173076] DMAR: IOMMU disabled
[ 2.245922] iommu: Default domain type: Translated
[ 2.245922] iommu: DMA domain TLB invalidation policy: lazy mode

10.10.10.10
sudo dmesg | grep -i dmar
No iputput
sudo dmesg | grep -i iommu
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=6e849e25-4931-4c06-8684-bb553962f200 ro amd_iommu=off quiet splash vt.handoff=7
[ 0.030879] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=6e849e25-4931-4c06-8684-bb553962f200 ro amd_iommu=off quiet splash vt.handoff=7
[ 1.861879] iommu: Default domain type: Translated
[ 1.861879] iommu: DMA domain TLB invalidation policy: lazy mode
i have set both iommu=off in the kernel but ouput are different.

What will the possible casue for this issue and how can i go deep to find the casue and find the solution.

Thanks

@sshaulnv
Copy link
Contributor

@ilovesouthpark
Copy link
Author

Seems like an issue we encountered. It may be relate to the MMIO base in the system BIOS of the HV. please try this solution: https://www.dell.com/support/manuals/en-il/vmware-esxi-6.5.x/esxi6.5.x_rn_pub/virtual-machines-fail-to-power-on-when-system-bios-has-mmio-set-to-56-tb-with-supported-gpu-config?guid=guid-ab3ea7a8-b8ca-481a-b6e2-d83ab989dac5

Thanks, i have noted this post and tried to find the coresponding setting in my bios (Z690 mainboard) and found one 4GB MMO one. In the default setting it links with Resize bar and i can disable it if i disable Resize Bar, i tried but failed. The direction which have mentioned issue still can not work but the other direction can. Hope someone else can share their solution or give some insigts. Thanks anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants