KVCacheManagerV2 host cache pool allocation is very slow for large host_cache_size (single-threaded faulting + THP compaction stalls)

### System Info

- GPU: NVIDIA B200 (2x, TP2)
- TensorRT-LLM: 1.3.0rc16 (code path also present on current main)
- Host: 2 TB RAM, long-running multi-tenant node
- Model: google/gemma-4-31B-it (Gemma 4 31B IT — a hybrid-attention model, which selects KVCacheManagerV2)

### Description

When `kv_cache_config.host_cache_size` is large (hundreds of GiB), engine startup spends tens of minutes allocating the host cache pool, and on a long-running / memory-pressured host it can fail to come up at all within any reasonable startup window.

`HostMem` (`tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`) mmaps the pool and lets `cuMemHostRegister` fault the pages in. Two factors make this pathological:

1. **Single-threaded lazy faulting** — every page is allocated and zeroed inside the `cuMemHostRegister` call, one thread, sequentially.
2. **THP compaction stalls** — the pool is advised `MADV_HUGEPAGE`; with `transparent_hugepage/defrag=madvise`, every 2 MiB fault enters synchronous direct compaction, which almost always fails once physical memory is fragmented (no free 2 MiB blocks), so each fault pays the compaction cost and still falls back to 4 KiB. Throughput drops from GB/s to GB/min, and the host-wide driver lock held during registration also slows other CUDA process startups and `nvidia-smi` on the machine.

Measured at DeepInfra on google/gemma-4-31B-it (TP2, 2x B200): at `host_cache_size=300GB`/rank the server never became ready — across multiple attempts the host cache pool had still not finished allocating after 45+ minutes, at which point we stopped. With the proposed change below the same configuration boots in ~4 minutes.

### Expected behavior

Large host cache pools should allocate in seconds-to-minutes, and startup time should not depend on host memory fragmentation state.

### Proposed enhancement

Populate the pool with parallel `MADV_POPULATE_WRITE` before registration (so `cuMemHostRegister` only pins already-resident pages), and provide an option to back the pool with regular 4 KiB pages instead of THP. Both gated by env vars, defaults preserving current behavior. With these, google/gemma-4-31B-it at 300 GiB/rank — which previously never finished booting — comes up in ~4 minutes.

I have a patch ready and will open a PR referencing this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVCacheManagerV2 host cache pool allocation is very slow for large host_cache_size (single-threaded faulting + THP compaction stalls) #15430

System Info

Description

Expected behavior

Proposed enhancement

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

KVCacheManagerV2 host cache pool allocation is very slow for large host_cache_size (single-threaded faulting + THP compaction stalls) #15430

Description

System Info

Description

Expected behavior

Proposed enhancement

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions