Skip to content

Conversation

@avinashkethineedi
Copy link
Contributor

@avinashkethineedi avinashkethineedi commented Nov 12, 2025

Motivation

  • This PR adds support for configurable QP allocation per PE in the GDA context.
  • It introduces environment variables to control the number of QPs for the default and user contexts.
  • Previously, the number of QPs per PE was fixed to 1.

Technical Details

Environment Variables

  • Added two new environment variables to configure QP allocation:
    • ROCSHMEM_GDA_NUM_QPS_PER_PE_DEFAULT_CTX: Number of QPs per PE in the default context.
    • ROCSHMEM_GDA_NUM_QPS_PER_PE_USR_CTX: Number of QPs per PE in each user context.
  • Introduced get_qp_index() device method to compute QP index using atomic counter
    for round-robin access.

Test Plan

  • Verify correctness by initializing multiple contexts with varying environment variable values.
  • Verifiy functionality and correctness across all three NIC types: mlx, bnxt, and ionic.
  • Measure and validate performance improvements with configurable QP allocation under different workloads.

Test Result

Submission Checklist

  • Code compiles successfully
  • All relevant unit tests pass
  • Verified behavior under multiple context configurations
  • Verified functionality and performance improvements with mlx
  • Verified functionality and performance improvements with bnxt
  • Verified functionality and performance improvements with ionic

- `ROCSHMEM_GDA_NUM_QPS_PER_PE_DEFAULT_CTX` to control the number of QPs per PE in the default context.
- `ROCSHMEM_GDA_NUM_QPS_PER_PE_USR_CTX` to control the number of QPs per PE in each user context.
- Added per-context QP allocation logic using environment variables
- Added `get_qp_index(int pe)` to compute QP index using atomic counter
  for round-robin access.
…d update RMA/atomic APIs

- Replaced per-thread atomic fetch with warp-synchronous logic using
  `__match_any_sync` and `__shfl_sync` to group threads targeting the same PE.
- Only the leader lane performs the atomic increment, reducing contention.
- Broadcasts the computed QP index to all participating lanes for efficiency.
- Updated RMA and atomic APIs to use the new warp-synchronized QP indexing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant