Add CQ pooling to avoid ~183ms ibv_create_cq on reconfigure (#1009)#1009
Open
Scusemua wants to merge 2 commits intometa-pytorch:mainfrom
Open
Add CQ pooling to avoid ~183ms ibv_create_cq on reconfigure (#1009)#1009Scusemua wants to merge 2 commits intometa-pytorch:mainfrom
Scusemua wants to merge 2 commits intometa-pytorch:mainfrom
Conversation
Contributor
486a7b8 to
b0dde88
Compare
Summary: Add bounds check and `ss_family` validation in `CtranIb::bootstrapConnect` to diagnose an intermittent crash during reconfigure cycles. The crash occurs when `toSocketAddress(allListenSocketAddrs[peerRank])` is called with a corrupt `sockaddr_storage` (e.g., `ss_family == 3`, expected `AF_INET=2` or `AF_INET6=10`). The current code throws `std::invalid_argument` which is uncaught by `CTRAN_ASYNC_ERR_GUARD`, causing `std::terminate()`. This change: - Validates `peerRank` is within bounds before indexing `allListenSocketAddrs` - Validates `ss_family` is `AF_INET` or `AF_INET6` before calling `toSocketAddress` - On corruption, logs a hex dump of the first 32 bytes, the `this` pointer (to detect stale objects), `commHash`, `commDesc`, and rank info - Returns `commInternalError` instead of crashing, which propagates cleanly through `FB_COMMCHECK` -> `FB_COMMCHECKTHROW_EX` -> `CTRAN_ASYNC_ERR_GUARD` Differential Revision: D96198714
…orch#1009) Summary: CQ creation (`ibv_create_cq`) costs ~183ms per call (99.95% of IB device setup). Pool CQs in `CtranIbSingleton` so subsequent inits/reconfigures reuse them (<1ms). **Fix 3 pre-existing bugs:** - CQ vector accumulation leak on reconfigure - Double mutex unlock UB in `initRemoteTransStates` - `IbvCq` move assignment leaking old CQ Add `NCCL_CTRAN_IB_CQ_POOL_ENABLE` CVAR (default: true). --- # Analysis of Benchmark Results ```sh buck2 run fbcode//mode/opt fbcode//comms/mccl/benchmarks:mccl_init_benchmarks -- --gtest_break_on_failure 2>&1 ``` **Key Findings from Claude Analysis** 1. **CQ pooling delivers a major reconfigure improvement:** Reconfigure P50 dropped from 277ms (D95275574) to 156ms (D95410686), a 44% reduction. This is the headline result. 2. **`Init` improved 26% vs D95275574:** Init P50 dropped from 198ms to 147ms. At 8 ranks, the improvement is even larger (46%, from 262ms to 141ms). 3. **Root cause of residual ~97ms identified:** `pollCq(maxCqe)` in the CQ drain loop allocates a 192MB vector (4.2M * 48 bytes). Potential fix is to use a small constant batch size (256). Expected to reduce init from ~147ms to ~50ms. 4. Lifecycle total is flat at 4-rank (508ms -> 507ms) because the init improvement is offset by the destroy regression (which is explained by the CQ accumulation bug fix). `ReconfigureCycle` total improved 25% (512ms -> 384ms), which is the metric that matters for fault-recovery. 5. **No correctness issues detected:** All 22 tests pass, cross-rank timing is consistent, CQ pool lifecycle is correct (no leaks, no double-frees). 6. **Next optimization priority is clear:** Fix the drain allocation (Step 1, ~1-line change, ~97ms savings), then instrument destroy (Step 3, P3). Differential Revision: D95403772
b0dde88 to
001b591
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
CQ creation (
ibv_create_cq) costs ~183ms per call (99.95% of IB device setup). Pool CQs inCtranIbSingletonso subsequent inits/reconfigures reuse them (<1ms).Fix 3 pre-existing bugs:
initRemoteTransStatesIbvCqmove assignment leaking old CQAdd
NCCL_CTRAN_IB_CQ_POOL_ENABLECVAR (default: true).Analysis of Benchmark Results
buck2 run fbcode//mode/opt fbcode//comms/mccl/benchmarks:mccl_init_benchmarks -- --gtest_break_on_failure 2>&1Key Findings from Claude Analysis
Initimproved 26% vs D95275574: Init P50 dropped from 198ms to 147ms. At 8 ranks, the improvement is even larger (46%, from 262ms to 141ms).pollCq(maxCqe)in the CQ drain loop allocates a 192MB vector (4.2M * 48 bytes). Potential fix is to use a small constant batch size (256). Expected to reduce init from ~147ms to ~50ms.ReconfigureCycletotal improved 25% (512ms -> 384ms), which is the metric that matters for fault-recovery.Differential Revision: D95403772