Skip to content

Add CQ pooling to avoid ~183ms ibv_create_cq on reconfigure (#1009)#1009

Open
Scusemua wants to merge 2 commits intometa-pytorch:mainfrom
Scusemua:export-D95403772
Open

Add CQ pooling to avoid ~183ms ibv_create_cq on reconfigure (#1009)#1009
Scusemua wants to merge 2 commits intometa-pytorch:mainfrom
Scusemua:export-D95403772

Conversation

@Scusemua
Copy link
Copy Markdown
Contributor

@Scusemua Scusemua commented Mar 10, 2026

Summary:

CQ creation (ibv_create_cq) costs ~183ms per call (99.95% of IB device setup). Pool CQs in CtranIbSingleton so subsequent inits/reconfigures reuse them (<1ms).

Fix 3 pre-existing bugs:

  • CQ vector accumulation leak on reconfigure
  • Double mutex unlock UB in initRemoteTransStates
  • IbvCq move assignment leaking old CQ

Add NCCL_CTRAN_IB_CQ_POOL_ENABLE CVAR (default: true).


Analysis of Benchmark Results

buck2 run fbcode//mode/opt fbcode//comms/mccl/benchmarks:mccl_init_benchmarks -- --gtest_break_on_failure 2>&1

Key Findings from Claude Analysis

  1. CQ pooling delivers a major reconfigure improvement: Reconfigure P50 dropped from 277ms (D95275574) to 156ms (D95410686), a 44% reduction. This is the headline result.
  2. Init improved 26% vs D95275574: Init P50 dropped from 198ms to 147ms. At 8 ranks, the improvement is even larger (46%, from 262ms to 141ms).
  3. Root cause of residual ~97ms identified: pollCq(maxCqe) in the CQ drain loop allocates a 192MB vector (4.2M * 48 bytes). Potential fix is to use a small constant batch size (256). Expected to reduce init from ~147ms to ~50ms.
  4. Lifecycle total is flat at 4-rank (508ms -> 507ms) because the init improvement is offset by the destroy regression (which is explained by the CQ accumulation bug fix). ReconfigureCycle total improved 25% (512ms -> 384ms), which is the metric that matters for fault-recovery.
  5. No correctness issues detected: All 22 tests pass, cross-rank timing is consistent, CQ pool lifecycle is correct (no leaks, no double-frees).
  6. Next optimization priority is clear: Fix the drain allocation (Step 1, ~1-line change, ~97ms savings), then instrument destroy (Step 3, P3).

Differential Revision: D95403772

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 10, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Mar 10, 2026

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95403772.

Summary:
Add bounds check and `ss_family` validation in `CtranIb::bootstrapConnect` to diagnose an intermittent crash during reconfigure cycles.

The crash occurs when `toSocketAddress(allListenSocketAddrs[peerRank])` is called with a corrupt `sockaddr_storage` (e.g., `ss_family == 3`, expected `AF_INET=2` or `AF_INET6=10`). The current code throws `std::invalid_argument` which is uncaught by `CTRAN_ASYNC_ERR_GUARD`, causing `std::terminate()`.

This change:
- Validates `peerRank` is within bounds before indexing `allListenSocketAddrs`
- Validates `ss_family` is `AF_INET` or `AF_INET6` before calling `toSocketAddress`
- On corruption, logs a hex dump of the first 32 bytes, the `this` pointer (to detect stale objects), `commHash`, `commDesc`, and rank info
- Returns `commInternalError` instead of crashing, which propagates cleanly through `FB_COMMCHECK` -> `FB_COMMCHECKTHROW_EX` -> `CTRAN_ASYNC_ERR_GUARD`

Differential Revision: D96198714
…orch#1009)

Summary:

CQ creation (`ibv_create_cq`) costs ~183ms per call (99.95% of IB device setup). Pool CQs in `CtranIbSingleton` so subsequent inits/reconfigures reuse them (<1ms).

**Fix 3 pre-existing bugs:**
- CQ vector accumulation leak on reconfigure
- Double mutex unlock UB in `initRemoteTransStates`
- `IbvCq` move assignment leaking old CQ

Add `NCCL_CTRAN_IB_CQ_POOL_ENABLE` CVAR (default: true).

---

# Analysis of Benchmark Results

```sh
buck2 run fbcode//mode/opt fbcode//comms/mccl/benchmarks:mccl_init_benchmarks -- --gtest_break_on_failure 2>&1
```

**Key Findings from Claude Analysis**
1. **CQ pooling delivers a major reconfigure improvement:** Reconfigure P50 dropped from 277ms (D95275574) to 156ms (D95410686), a 44% reduction. This is the headline result.
2. **`Init` improved 26% vs D95275574:** Init P50 dropped from 198ms to 147ms. At 8 ranks, the improvement is even larger (46%, from 262ms to 141ms).
3. **Root cause of residual ~97ms identified:** `pollCq(maxCqe)` in the CQ drain loop allocates a 192MB vector (4.2M * 48 bytes). Potential fix is to use a small constant batch size (256). Expected to reduce init from ~147ms to ~50ms.
4. Lifecycle total is flat at 4-rank (508ms -> 507ms) because the init improvement is offset by the destroy regression (which is explained by the CQ accumulation bug fix). `ReconfigureCycle` total improved 25% (512ms -> 384ms), which is the metric that matters for fault-recovery.
5. **No correctness issues detected:** All 22 tests pass, cross-rank timing is consistent, CQ pool lifecycle is correct (no leaks, no double-frees).
6. **Next optimization priority is clear:** Fix the drain allocation (Step 1, ~1-line change, ~97ms savings), then instrument destroy (Step 3, P3).

Differential Revision: D95403772
@meta-codesync meta-codesync bot changed the title Add CQ pooling to avoid ~183ms ibv_create_cq on reconfigure Add CQ pooling to avoid ~183ms ibv_create_cq on reconfigure (#1009) Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant