Skip to content

KVCacheManagerV2 host cache pool allocation is very slow for large host_cache_size (single-threaded faulting + THP compaction stalls) #15430

Description

@nafis271

System Info

  • GPU: NVIDIA B200 (2x, TP2)
  • TensorRT-LLM: 1.3.0rc16 (code path also present on current main)
  • Host: 2 TB RAM, long-running multi-tenant node
  • Model: google/gemma-4-31B-it (Gemma 4 31B IT — a hybrid-attention model, which selects KVCacheManagerV2)

Description

When kv_cache_config.host_cache_size is large (hundreds of GiB), engine startup spends tens of minutes allocating the host cache pool, and on a long-running / memory-pressured host it can fail to come up at all within any reasonable startup window.

HostMem (tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py) mmaps the pool and lets cuMemHostRegister fault the pages in. Two factors make this pathological:

  1. Single-threaded lazy faulting — every page is allocated and zeroed inside the cuMemHostRegister call, one thread, sequentially.
  2. THP compaction stalls — the pool is advised MADV_HUGEPAGE; with transparent_hugepage/defrag=madvise, every 2 MiB fault enters synchronous direct compaction, which almost always fails once physical memory is fragmented (no free 2 MiB blocks), so each fault pays the compaction cost and still falls back to 4 KiB. Throughput drops from GB/s to GB/min, and the host-wide driver lock held during registration also slows other CUDA process startups and nvidia-smi on the machine.

Measured at DeepInfra on google/gemma-4-31B-it (TP2, 2x B200): at host_cache_size=300GB/rank the server never became ready — across multiple attempts the host cache pool had still not finished allocating after 45+ minutes, at which point we stopped. With the proposed change below the same configuration boots in ~4 minutes.

Expected behavior

Large host cache pools should allocate in seconds-to-minutes, and startup time should not depend on host memory fragmentation state.

Proposed enhancement

Populate the pool with parallel MADV_POPULATE_WRITE before registration (so cuMemHostRegister only pins already-resident pages), and provide an option to back the pool with regular 4 KiB pages instead of THP. Both gated by env vars, defaults preserving current behavior. With these, google/gemma-4-31B-it at 300 GiB/rank — which previously never finished booting — comes up in ~4 minutes.

I have a patch ready and will open a PR referencing this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    KV-Cache Managementkv-cache management for efficient LLM inference

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions