System Info
- GPU: NVIDIA B200 (2x, TP2)
- TensorRT-LLM: 1.3.0rc16 (code path also present on current main)
- Host: 2 TB RAM, long-running multi-tenant node
- Model: google/gemma-4-31B-it (Gemma 4 31B IT — a hybrid-attention model, which selects KVCacheManagerV2)
Description
When kv_cache_config.host_cache_size is large (hundreds of GiB), engine startup spends tens of minutes allocating the host cache pool, and on a long-running / memory-pressured host it can fail to come up at all within any reasonable startup window.
HostMem (tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py) mmaps the pool and lets cuMemHostRegister fault the pages in. Two factors make this pathological:
- Single-threaded lazy faulting — every page is allocated and zeroed inside the
cuMemHostRegister call, one thread, sequentially.
- THP compaction stalls — the pool is advised
MADV_HUGEPAGE; with transparent_hugepage/defrag=madvise, every 2 MiB fault enters synchronous direct compaction, which almost always fails once physical memory is fragmented (no free 2 MiB blocks), so each fault pays the compaction cost and still falls back to 4 KiB. Throughput drops from GB/s to GB/min, and the host-wide driver lock held during registration also slows other CUDA process startups and nvidia-smi on the machine.
Measured at DeepInfra on google/gemma-4-31B-it (TP2, 2x B200): at host_cache_size=300GB/rank the server never became ready — across multiple attempts the host cache pool had still not finished allocating after 45+ minutes, at which point we stopped. With the proposed change below the same configuration boots in ~4 minutes.
Expected behavior
Large host cache pools should allocate in seconds-to-minutes, and startup time should not depend on host memory fragmentation state.
Proposed enhancement
Populate the pool with parallel MADV_POPULATE_WRITE before registration (so cuMemHostRegister only pins already-resident pages), and provide an option to back the pool with regular 4 KiB pages instead of THP. Both gated by env vars, defaults preserving current behavior. With these, google/gemma-4-31B-it at 300 GiB/rank — which previously never finished booting — comes up in ~4 minutes.
I have a patch ready and will open a PR referencing this issue.
System Info
Description
When
kv_cache_config.host_cache_sizeis large (hundreds of GiB), engine startup spends tens of minutes allocating the host cache pool, and on a long-running / memory-pressured host it can fail to come up at all within any reasonable startup window.HostMem(tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py) mmaps the pool and letscuMemHostRegisterfault the pages in. Two factors make this pathological:cuMemHostRegistercall, one thread, sequentially.MADV_HUGEPAGE; withtransparent_hugepage/defrag=madvise, every 2 MiB fault enters synchronous direct compaction, which almost always fails once physical memory is fragmented (no free 2 MiB blocks), so each fault pays the compaction cost and still falls back to 4 KiB. Throughput drops from GB/s to GB/min, and the host-wide driver lock held during registration also slows other CUDA process startups andnvidia-smion the machine.Measured at DeepInfra on google/gemma-4-31B-it (TP2, 2x B200): at
host_cache_size=300GB/rank the server never became ready — across multiple attempts the host cache pool had still not finished allocating after 45+ minutes, at which point we stopped. With the proposed change below the same configuration boots in ~4 minutes.Expected behavior
Large host cache pools should allocate in seconds-to-minutes, and startup time should not depend on host memory fragmentation state.
Proposed enhancement
Populate the pool with parallel
MADV_POPULATE_WRITEbefore registration (socuMemHostRegisteronly pins already-resident pages), and provide an option to back the pool with regular 4 KiB pages instead of THP. Both gated by env vars, defaults preserving current behavior. With these, google/gemma-4-31B-it at 300 GiB/rank — which previously never finished booting — comes up in ~4 minutes.I have a patch ready and will open a PR referencing this issue.