Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ib_write_bw work normally but ib_write_bw -R failed #132

Open
sjc2870 opened this issue Jan 17, 2022 · 9 comments
Open

ib_write_bw work normally but ib_write_bw -R failed #132

sjc2870 opened this issue Jan 17, 2022 · 9 comments

Comments

@sjc2870
Copy link

sjc2870 commented Jan 17, 2022

This is output of 'ib_write_bw -a -d mlx5_0 --report_gbits node1', seems to work fine:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  node1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0032 PSN 0x5f2841 RKey 0x002440 VAddr 0x007f2e5f64b000
 remote address: LID 0x01 QPN 0x003a PSN 0xc0fd7e RKey 0x002442 VAddr 0x007f443b2f8000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          5000           0.042350            0.042185            2.636533
 4          5000           0.084507            0.084457            2.639276
 8          5000             0.17               0.17   		   2.641399
 16         5000             0.34               0.34   		   2.640952
 32         5000             0.68               0.68   		   2.638957
 64         5000             1.35               1.35   		   2.638629
 128        5000             2.71               2.71   		   2.643606
 256        5000             5.42               5.42   		   2.644112
 512        5000             10.78              10.77  		   2.629186
 1024       5000             21.38              21.37  		   2.608802
 2048       5000             42.13              42.09  		   2.568967
 4096       5000             83.97              83.91  		   2.560721
 8192       5000             186.89             149.84 		   2.286319
 16384      5000             195.18             169.98 		   1.296822
 32768      5000             196.21             185.39 		   0.707209
 65536      5000             196.25             190.26 		   0.362886
 131072     5000             196.33             193.93 		   0.184945
 262144     5000             195.49             195.03 		   0.092996
 524288     5000             196.25             196.25 		   0.046789
 1048576    5000             196.48             196.48 		   0.023422
 2097152    5000             196.62             196.59 		   0.011718
 4194304    5000             196.67             196.63 		   0.005860
 8388608    5000             196.63             196.58 		   0.002929
---------------------------------------------------------------------------------------

But it would fail if I plus '-R', like:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  -R node1
Received 10 times ADDR_ERROR
 Unable to perform rdma_client function
 Unable to init the socket connection

And I read source code and have known it's caused by RDMA_CM_EVENT_ADDR_ERROR, but I don't known why.

This is output about 'lscpi -vvv':

[root@node3 bin]#  lspci -vvv | grep Mellanox  -A 65
41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
	Subsystem: Mellanox Technologies Device 0007
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 1125
	NUMA node: 0
	Region 0: Memory at 2807e000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at b4400000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                          
		Read-only fields:
			[PN] Part number: MCX653105A-HDAT          
			[EC] Engineering changes: AE
			[V2] Vendor specific: MCX653105A-HDAT          
			[SN] Serial number: MT2130T07644   
			[V3] Vendor specific: 92a87ffbcbeaeb118000b8cef6f7f1c0
			[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653105A      
			[V0] Vendor specific: PCIeGen4 x16 
			[VU] Vendor specific: MT2130T07644MLNXS0D0F0 
			[RV] Reserved: checksum good, 1 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] #19
	Capabilities: [320 v1] #27
	Capabilities: [370 v1] #26
	Capabilities: [420 v1] #25
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

42:00.0 Non-Volatile memory controller: Intel Corporation NVMe DC SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation Device 8008

Any clue about what happened? look forward to your reply, thanks!

@HassanKhadour
Copy link
Contributor

Hi, please make sure to use the interface ip instead of host name when using -R option

@sjc2870
Copy link
Author

sjc2870 commented Jan 18, 2022

Hi, please make sure to use the interface ip instead of host name when using -R option

Thanks!I tried to use the ip interface before your reply,but still failed.
And your reply reminded me that I need to use the address of the ib network card but not tcp/ip...
Thanks a lot for your reply!
Wish you good health and every success!

@HassanKhadour
Copy link
Contributor

Hi sjc2870, thanks! Wish you the same.
does it still repro? did you solve the Issue?

@Taco0220
Copy link

This is output of 'ib_write_bw -a -d mlx5_0 --report_gbits node1', seems to work fine:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  node1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0032 PSN 0x5f2841 RKey 0x002440 VAddr 0x007f2e5f64b000
 remote address: LID 0x01 QPN 0x003a PSN 0xc0fd7e RKey 0x002442 VAddr 0x007f443b2f8000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          5000           0.042350            0.042185            2.636533
 4          5000           0.084507            0.084457            2.639276
 8          5000             0.17               0.17   		   2.641399
 16         5000             0.34               0.34   		   2.640952
 32         5000             0.68               0.68   		   2.638957
 64         5000             1.35               1.35   		   2.638629
 128        5000             2.71               2.71   		   2.643606
 256        5000             5.42               5.42   		   2.644112
 512        5000             10.78              10.77  		   2.629186
 1024       5000             21.38              21.37  		   2.608802
 2048       5000             42.13              42.09  		   2.568967
 4096       5000             83.97              83.91  		   2.560721
 8192       5000             186.89             149.84 		   2.286319
 16384      5000             195.18             169.98 		   1.296822
 32768      5000             196.21             185.39 		   0.707209
 65536      5000             196.25             190.26 		   0.362886
 131072     5000             196.33             193.93 		   0.184945
 262144     5000             195.49             195.03 		   0.092996
 524288     5000             196.25             196.25 		   0.046789
 1048576    5000             196.48             196.48 		   0.023422
 2097152    5000             196.62             196.59 		   0.011718
 4194304    5000             196.67             196.63 		   0.005860
 8388608    5000             196.63             196.58 		   0.002929
---------------------------------------------------------------------------------------

But it would fail if I plus '-R', like:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  -R node1
Received 10 times ADDR_ERROR
 Unable to perform rdma_client function
 Unable to init the socket connection

And I read source code and have known it's caused by RDMA_CM_EVENT_ADDR_ERROR, but I don't known why.

This is output about 'lscpi -vvv':

[root@node3 bin]#  lspci -vvv | grep Mellanox  -A 65
41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
	Subsystem: Mellanox Technologies Device 0007
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 1125
	NUMA node: 0
	Region 0: Memory at 2807e000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at b4400000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                          
		Read-only fields:
			[PN] Part number: MCX653105A-HDAT          
			[EC] Engineering changes: AE
			[V2] Vendor specific: MCX653105A-HDAT          
			[SN] Serial number: MT2130T07644   
			[V3] Vendor specific: 92a87ffbcbeaeb118000b8cef6f7f1c0
			[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653105A      
			[V0] Vendor specific: PCIeGen4 x16 
			[VU] Vendor specific: MT2130T07644MLNXS0D0F0 
			[RV] Reserved: checksum good, 1 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] #19
	Capabilities: [320 v1] #27
	Capabilities: [370 v1] #26
	Capabilities: [420 v1] #25
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

42:00.0 Non-Volatile memory controller: Intel Corporation NVMe DC SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation Device 8008

Any clue about what happened? look forward to your reply, thanks!

I failed at this step, I don't know what happened
“Failed to modify QP to RTS
Unable to Connect the HCA's through the link”

@HassanKhadour
Copy link
Contributor

This is output of 'ib_write_bw -a -d mlx5_0 --report_gbits node1', seems to work fine:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  node1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0032 PSN 0x5f2841 RKey 0x002440 VAddr 0x007f2e5f64b000
 remote address: LID 0x01 QPN 0x003a PSN 0xc0fd7e RKey 0x002442 VAddr 0x007f443b2f8000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          5000           0.042350            0.042185            2.636533
 4          5000           0.084507            0.084457            2.639276
 8          5000             0.17               0.17   		   2.641399
 16         5000             0.34               0.34   		   2.640952
 32         5000             0.68               0.68   		   2.638957
 64         5000             1.35               1.35   		   2.638629
 128        5000             2.71               2.71   		   2.643606
 256        5000             5.42               5.42   		   2.644112
 512        5000             10.78              10.77  		   2.629186
 1024       5000             21.38              21.37  		   2.608802
 2048       5000             42.13              42.09  		   2.568967
 4096       5000             83.97              83.91  		   2.560721
 8192       5000             186.89             149.84 		   2.286319
 16384      5000             195.18             169.98 		   1.296822
 32768      5000             196.21             185.39 		   0.707209
 65536      5000             196.25             190.26 		   0.362886
 131072     5000             196.33             193.93 		   0.184945
 262144     5000             195.49             195.03 		   0.092996
 524288     5000             196.25             196.25 		   0.046789
 1048576    5000             196.48             196.48 		   0.023422
 2097152    5000             196.62             196.59 		   0.011718
 4194304    5000             196.67             196.63 		   0.005860
 8388608    5000             196.63             196.58 		   0.002929
---------------------------------------------------------------------------------------

But it would fail if I plus '-R', like:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  -R node1
Received 10 times ADDR_ERROR
 Unable to perform rdma_client function
 Unable to init the socket connection

And I read source code and have known it's caused by RDMA_CM_EVENT_ADDR_ERROR, but I don't known why.
This is output about 'lscpi -vvv':

[root@node3 bin]#  lspci -vvv | grep Mellanox  -A 65
41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
	Subsystem: Mellanox Technologies Device 0007
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 1125
	NUMA node: 0
	Region 0: Memory at 2807e000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at b4400000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                          
		Read-only fields:
			[PN] Part number: MCX653105A-HDAT          
			[EC] Engineering changes: AE
			[V2] Vendor specific: MCX653105A-HDAT          
			[SN] Serial number: MT2130T07644   
			[V3] Vendor specific: 92a87ffbcbeaeb118000b8cef6f7f1c0
			[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653105A      
			[V0] Vendor specific: PCIeGen4 x16 
			[VU] Vendor specific: MT2130T07644MLNXS0D0F0 
			[RV] Reserved: checksum good, 1 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] #19
	Capabilities: [320 v1] #27
	Capabilities: [370 v1] #26
	Capabilities: [420 v1] #25
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

42:00.0 Non-Volatile memory controller: Intel Corporation NVMe DC SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation Device 8008

Any clue about what happened? look forward to your reply, thanks!

I failed at this step, I don't know what happened “Failed to modify QP to RTS Unable to Connect the HCA's through the link”

Please try to use the interface ip and not hostname when running rdmacm

@Taco0220
Copy link

I use the interface IP:
server error:
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdma_cm
Failed to exchange data between server and clients
client error:
Failed to modify QP to RTS
Unable to Connect the HCA's through the link

@HassanKhadour
Copy link
Contributor

Can you please share the setup info, OS, cards etc.. so I can try to reproduce the issue?

@Taco0220
Copy link

sorry,my os is '6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC'。netcard is Intel Corporation Ethernet Connection X722.

@codinggosu
Copy link

any updates on this?
seeing similar issues on rocky linux 9, with a differenct card

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants