Skip to content

Comments

[WIP][UR][CUDA][TEST] Add P2P initialization to multi-device test#21311

Draft
kekaczma wants to merge 4 commits intosyclfrom
multi-device-test
Draft

[WIP][UR][CUDA][TEST] Add P2P initialization to multi-device test#21311
kekaczma wants to merge 4 commits intosyclfrom
multi-device-test

Conversation

@kekaczma
Copy link
Contributor

@kekaczma kekaczma commented Feb 18, 2026

Initialize P2P access between device pairs in
urEnqueueKernelLaunchIncrementMultiDeviceTest to enable cross-device USM memcpy operations on CUDA.

  • Add urUsmP2PEnablePeerAccessExp calls in SetUp()
  • Add urUsmP2PDisablePeerAccessExp calls in TearDown()
  • Skip P2P for duplicate device handles (single GPU case)
  • Handle already-enabled and unsupported device pairs

This is a test commit to validate the fix on multi-GPU hardware.

@kekaczma kekaczma changed the title [UR][CUDA][TEST] Add P2P initialization to multi-device test [WIP][UR][CUDA][TEST] Add P2P initialization to multi-device test Feb 18, 2026
Fix CUDA adapter to properly map P2P access errors:
- Map CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED to UR_RESULT_ERROR_INVALID_OPERATION
- Map CUDA_ERROR_PEER_ACCESS_NOT_ENABLED to UR_RESULT_ERROR_INVALID_OPERATION

Initialize P2P access in urEnqueueKernelLaunchIncrementMultiDeviceTest:
- Add urUsmP2PEnablePeerAccessExp calls in SetUp() for cross-device memcpy
- Add urUsmP2PDisablePeerAccessExp calls in TearDown() for cleanup
- Skip P2P operations for duplicate device handles (single GPU case)
- Accept INVALID_OPERATION for already-enabled or unsupported pairs

This fixes test failures on multi-GPU CUDA systems where P2P must be
explicitly enabled before cross-device USM memory operations.

Fixes #19033
Changes:
- Track enabled P2P pairs in member variable enabledP2PPairs
- SetUp: Only record pairs WE successfully enabled (both SUCCESS)
- TearDown: Disable P2P bidirectionally for our pairs, ignore errors
- Removes global P2P state dependency between test instances

Works for both:
- 2 physical GPUs duplicated 4× (8 logical devices)
- 8 distinct physical GPUs

Fixes #19033
The CUDA adapter was using cuMemcpyAsync() for all USM memory copies,
including cross-device copies. However, CUDA requires cuMemcpyPeerAsync()
for peer-to-peer copies between different devices, even when P2P access
is enabled via cuCtxEnablePeerAccess().

This change:
- Detects cross-device copies by querying CU_POINTER_ATTRIBUTE_CONTEXT
  for both source and destination pointers
- Uses cuMemcpyPeerAsync() when contexts differ (cross-device copy)
- Falls back to cuMemcpyAsync() for same-device or host-device copies

This fixes the urEnqueueKernelLaunchIncrementMultiDeviceTest which
chains kernel launches and cross-device memcpy operations.

Fixes: #19033
In single-context multi-device setup on CUDA, pointer attributes cannot
reliably distinguish cross-device copies because all allocations share
the same CUDA context and may report device ordinal 0.

Solution: When context has >1 device, try cuMemcpyPeerAsync for all
device pairs until one succeeds. Falls back to cuMemcpyAsync if none work
or if single-device context.

This is a workaround - proper solution would track allocation metadata.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant