Skip to content

Conversation

JyChang012
Copy link
Member

@JyChang012 JyChang012 commented Oct 11, 2025

Still need to remove some code for logging / testing

Summary by CodeRabbit

  • New Features

    • Added CUDA Graph mode for LoRA with multi-adapter batching and slot management.
    • Introduced fused parameter preparation and row reordering to reduce kernel launches.
    • Exposed a device-side cache check for tasks in Python.
    • Enabled optional NVTX profiling wrappers for easier performance tracing.
  • Performance

    • Implemented CUDA Graph–compatible grouped and split-K GEMM paths for faster LoRA execution.
    • Reduced per-step overhead via persistent buffers and slot reuse.
  • Tests

    • Expanded test coverage to run LoRA scenarios with and without CUDA Graph, including edge cases.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@JyChang012 JyChang012 requested review from a team as code owners October 11, 2025 01:43
@JyChang012 JyChang012 requested review from hlu1, pcastonguay and venkywonka and removed request for hlu1, pcastonguay and venkywonka October 11, 2025 01:43
Copy link
Contributor

coderabbitai bot commented Oct 11, 2025

📝 Walkthrough

Walkthrough

Adds CUDA Graph-based multi-LoRA execution: new grouped GEMM kernels and a fused param-fill/reorder kernel; Torch bindings and THOP entry points; Python-side CUDA Graph LoRA params, slot and manager classes; engine integration with optional CUDA Graph path; PEFT cache/device-lookup extensions; tests and NVTX profiling hooks. Also adjusts attention and miscellaneous bindings.

Changes

Cohort / File(s) Summary
PEFT cache API updates
cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h, cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp
Add PeftCacheManager::isTaskCachedDevice; ensureBatch maps taskId to device-resolved LoRA config.
Bindings for PEFT cache
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp, cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
Expose is_task_cached and new is_task_cached_device to Python with GIL release.
CUDA Graph grouped GEMM kernels
cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h, cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu
New CUDA Graph-compatible grouped GEMM and split-K grouped GEMM implementations and declarations.
LoRA fused prep kernel
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h, cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu
New fused kernel to fill params, row-reorder, zero-fill, with launcher API.
LoRA kernel comment
cpp/tensorrt_llm/kernels/lora/lora.cpp
Add clarifying comment on GemmCoord usage.
THOP LoRA ops
cpp/tensorrt_llm/thop/loraOp.cpp
Add CUDA Graph grouped GEMM path and fused param-fill/reorder entry; register Torch ops.
Attention module tweak
tensorrt_llm/_torch/modules/attention.py
Wrap o_lora init in string literal; o_lora assignment skipped.
CUDA Graph LoRA manager and params
tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py, .../cuda_graph_lora_manager.py, .../cuda_graph_lora_params.py
Add AdapterSlotManager (LRU slots), CudaGraphLoraManager (prep flow), and CudaGraphLoraParams (persistent CUDA Graph buffers, pointers, sizes).
LoRA layer integration
tensorrt_llm/_torch/peft/lora/layer.py
Add CUDA Graph mode in forward, buffer prep helpers, dataclasses for grouped GEMM params, tensorized size metadata.
Engine integration
tensorrt_llm/_torch/pyexecutor/model_engine.py, .../_util.py, .../py_executor.py, tensorrt_llm/executor/worker.py
Initialize CUDA Graph LoRA manager, propagate maybe_graph, add NVTX emit decorator and tracing wrapper; minor prints.
Resource/PEFT plumbing
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Add batch PEFT table getters/reset; expose is_task_cached_device; track batch PEFT state.
NVTX utility
tensorrt_llm/_utils.py
Add nvtx_pytorch_emit decorator factory.
Tests: CUDA Graph LoRA
tests/unittest/llmapi/lora_test_utils.py, tests/unittest/llmapi/test_llm_pytorch.py
Add CUDA Graph LoRA test params and helpers; parametrize many tests with cuda_graph_config; add kernel special-case tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Engine as PyTorchModelEngine
  participant LoraMgr as CudaGraphLoraManager
  participant SlotMgr as AdapterSlotManager
  participant Peft as PeftCacheManager
  participant Params as CudaGraphLoraParams
  participant THOP as thop lora ops
  participant Kern as CUDA Graph GEMM Kernels

  Engine->>LoraMgr: prepare_cuda_graph_lora_params(scheduled_requests, attn_metadata, peft_cache_manager)
  LoraMgr->>Peft: get_and_reset_batch_peft_table()
  LoraMgr->>SlotMgr: update_slots(requests, peft_cache_manager)
  SlotMgr-->>LoraMgr: batch_slot_ids, slots_changed
  LoraMgr->>Params: update_sorted_indices(batch_slot_ids)
  alt slots_changed
    LoraMgr->>Params: update_weight_pointers(peft_table)
    LoraMgr->>SlotMgr: reset_changed_flag()
  end
  LoraMgr->>Params: update_slots_params(batch_slot_ids)
  LoraMgr-->>Engine: {cuda_graph_params, use_cuda_graph_mode, ...}
  Engine->>THOP: lora_group_gemm_param_fill_row_reorder_fusion(...)
  THOP->>Kern: launchLoraGroupGEMMParamFillRowReorderFusion(...)
  Engine->>THOP: lora_grouped_gemm_cuda_graph(... in/out ...)
  THOP->>Kern: cuda_graph_grouped_gemm / splitk_grouped_gemm(...)
Loading
sequenceDiagram
  autonumber
  participant Layer as LoraLayer.forward
  participant CG as CUDA-Graph path
  participant Legacy as Legacy path

  Layer->>Layer: decide mode (cuda_graph_enabled && params available)
  alt CUDA Graph mode
    Layer->>CG: _forward_cuda_graph_mode(...)
    CG->>CG: prepare_grouped_gemm_buffers / fused prep
    CG-->>Layer: output tensor or None
  else Legacy mode
    Layer->>Legacy: _forward_legacy_mode(...)
    Legacy-->>Layer: output tensor or None
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description consists solely of the unmodified template and lacks any actual summary of the issue, explanation of the solution, or listing of relevant tests under the required “Description” and “Test Coverage” sections, making the intent and scope of the changes unclear. Please fill in the “Description” section with a concise explanation of the problem and the implemented solution, and complete the “Test Coverage” section by listing the new or updated tests that verify the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title follows the repository’s template by including “[None][feat]” and concisely summarizes the primary change of enabling multi-LoRA serving with CUDA Graph, which aligns directly with the extensive CUDA Graph and multi-LoRA support additions in the PR.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

362-372: Initialize CUDA Graph LoRA manager when LoRA is configured

_set_lora_model_config does not call _init_cuda_graph_lora_manager, so graph mode will never be used. Call it after setting lora_model_config when cuda_graph_runner is enabled.

 def set_lora_model_config(self,
                           lora_target_modules: list[str],
                           trtllm_modules_to_hf_modules: dict[str, str],
                           swap_gate_up_proj_lora_b_weight: bool = True):
     self.lora_model_config = LoraModelConfig(
         lora_target_modules=lora_target_modules,
         trtllm_modules_to_hf_modules=trtllm_modules_to_hf_modules,
         hidden_size=self.model.config.hidden_size,
         dtype=torch_dtype_to_str(self.model.config.torch_dtype),
         swap_gate_up_proj_lora_b_weight=swap_gate_up_proj_lora_b_weight)
+    # Initialize CUDA Graph LoRA manager if possible
+    lora_config = getattr(self.pytorch_backend_config, "lora_config", None)
+    if lora_config is not None:
+        self._init_cuda_graph_lora_manager(lora_config)

Also applies to: 373-392

tests/unittest/llmapi/test_llm_pytorch.py (2)

883-895: Remove duplicate class definition.

The TestLlmError class is defined twice (lines 883-895 and 1013-1024) with identical test_max_num_token_check methods. The second definition will shadow the first. Remove one of the duplicate class definitions.

Also applies to: 1013-1024


1-1: Add NVIDIA copyright header.

The coding guidelines require prepending the NVIDIA Apache-2.0 copyright header with the current year to the top of all source files.

As per coding guidelines.

🧹 Nitpick comments (13)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h (1)

1-75: LGTM! Header declaration is well-documented and follows conventions.

The new kernel launcher header:

  • Comprehensive documentation with @brief and detailed parameter descriptions
  • Proper copyright header with current year (2025)
  • Correct namespace usage (tensorrt_llm::kernels)
  • Function signature matches implementation (verified against .cu file snippet)
  • Appropriate use of #pragma once for include guard

The function consolidates multiple operations (param fill, row reorder, zero fill) into a single CUDA graph-compatible kernel, which is a good design for performance.

Optional suggestion: Consider using traditional include guards instead of #pragma once to align with the coding guideline that specifies the format TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H. However, #pragma once is widely supported and more concise, so this is a minor stylistic preference.

cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h (1)

28-59: Prefer Doxygen-style API docs for public headers.

Convert to //! and document all parameters (e.g., lda/ldb/ldc/ldd and host vs device pointers).

As per coding guidelines

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

1238-1243: Fix return type to Optional (can be None).

Method can return None after reset; align annotation to avoid confusion.

-    def get_and_reset_batch_peft_table(
-            self) -> Dict[int, list[TaskLayerModuleConfig]]:
+    def get_and_reset_batch_peft_table(
+            self) -> Optional[Dict[int, List[TaskLayerModuleConfig]]]:

As per coding guidelines

tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py (1)

151-151: Remove no-op statement.

len(request_list) has no effect.

-        len(request_list)
+        # no-op
tensorrt_llm/_torch/peft/lora/layer.py (5)

89-98: Avoid mutable default args; silence unused variable

Use None defaults and initialize inside. Prefix unused local to satisfy linters.

-def compare_grouped_gemm_params(
-    params: GroupedGemmParamsOutput,
-    ref: GroupedGemmParamsOutput,
-    params_input: GroupedGemmParamsInput,
-    params_to_store_msg: List[str] | None = ['splitk_offsets'],
-    params_exclude_msg: List[str] | None = None,
-):
+def compare_grouped_gemm_params(
+    params: GroupedGemmParamsOutput,
+    ref: GroupedGemmParamsOutput,
+    params_input: GroupedGemmParamsInput,
+    params_to_store_msg: Optional[List[str]] = None,
+    params_exclude_msg: Optional[List[str]] = None,
+):
@@
-    bs, input_hidden_size = params.reordered_input.shape
+    bs, _input_hidden_size = params.reordered_input.shape
@@
-    if not params_to_store_msg:
-        params_to_store_msg = set(params_dict.keys())
+    if not params_to_store_msg:
+        params_to_store_msg = set(params_dict.keys())

Based on learnings


114-123: Type-safe equality for integer tensors in debug compare

torch.allclose is for floating dtypes. Use torch.equal for integral tensors to avoid runtime errors when debug is enabled.

-        if name not in ("reordered_input", "a_offset"):
-            asserter.add(
-                v.allclose(ref_v),
-                get_msg(name, v, ref_v),
-            )
+        if name not in ("reordered_input", "a_offset"):
+            if v.dtype.is_floating_point:
+                ok = torch.allclose(v, ref_v)
+            else:
+                ok = torch.equal(v, ref_v)
+            asserter.add(ok, get_msg(name, v, ref_v))

290-291: Tuple construction nit: use splat for shape_3d

Cleaner and avoids creating an intermediate tuple.

-        shape_3d = shape_2d + (3, )
+        shape_3d = (*shape_2d, 3)

Apply in all three sites.

Also applies to: 413-414, 482-483


593-597: Remove f-prefix from constant strings

Minor lint (F541). Drop the f where no placeholders exist.

-            print(
-                f'--------------------------------layer key: {layer_key}--------------------------------'
-            )
-            print(f'cuda graph params values:')
+            print(
+                f'--------------------------------layer key: {layer_key}--------------------------------'
+            )
+            print('cuda graph params values:')
@@
-            print(f'buffers values:')
+            print('buffers values:')
@@
-            print(f'calculated buffers')
+            print('calculated buffers')

Also applies to: 610-618


44-52: Gate or remove debug flags before merge

These globals alter runtime paths and include heavy printing/assert scaffolding. Consider gating via env vars or removing before release.

tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (3)

144-146: Silence unused args in _create_layer_params

Rename to underscore to satisfy linters and reflect non-use.

-    def _create_layer_params(
-            self, key: LoraLayerKey, layer_module_num: int,
-            module_output_sizes: torch.Tensor) -> LoraLayerParams:
+    def _create_layer_params(
+            self, _key: LoraLayerKey, layer_module_num: int,
+            _module_output_sizes: torch.Tensor) -> LoraLayerParams:

183-187: Silence unused local in get_sorted_indices

Prefix with underscore.

-        sorted_slot_ids, sorted_indices = torch.sort(slot_ids, stable=True)
+        _sorted_slot_ids, sorted_indices = torch.sort(slot_ids, stable=True)

70-79: Avoid print in library init

Prefer logger or remove to keep init silent.

-        print(
-            f'cuda graph lora params init max batch size: {max_batch_size}, max lora size: {max_lora_size}, max rank: {max_rank}'
-        )
+        # Consider using logger.debug for initialization info.
tests/unittest/llmapi/test_llm_pytorch.py (1)

270-270: Avoid global CUDA device setting in tests. Replace torch.cuda.set_device(0) with a scoped approach (e.g., a pytest fixture or with torch.cuda.device(0):) to isolate device selection and prevent interference when tests run in parallel.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84d2f12 and 1d61b14.

📒 Files selected for processing (24)
  • cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (3 hunks)
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h (1 hunks)
  • cpp/tensorrt_llm/kernels/lora/lora.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/thop/loraOp.cpp (3 hunks)
  • tensorrt_llm/_torch/modules/attention.py (1 hunks)
  • tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py (1 hunks)
  • tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py (1 hunks)
  • tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (1 hunks)
  • tensorrt_llm/_torch/peft/lora/layer.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/guided_decoder.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (21 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (5 hunks)
  • tensorrt_llm/_utils.py (1 hunks)
  • tensorrt_llm/executor/worker.py (2 hunks)
  • tests/unittest/llmapi/lora_test_utils.py (2 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (20 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tensorrt_llm/kernels/lora/lora.cpp
  • cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h
  • cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h
  • cpp/tensorrt_llm/thop/loraOp.cpp
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tensorrt_llm/kernels/lora/lora.cpp
  • cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h
  • cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h
  • cpp/tensorrt_llm/thop/loraOp.cpp
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • cpp/tensorrt_llm/kernels/lora/lora.cpp
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h
  • tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py
  • tensorrt_llm/_torch/pyexecutor/guided_decoder.py
  • cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu
  • tests/unittest/llmapi/lora_test_utils.py
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h
  • cpp/tensorrt_llm/thop/loraOp.cpp
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h
  • tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu
  • tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_utils.py
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • tensorrt_llm/_torch/peft/lora/layer.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • tests/unittest/llmapi/test_llm_pytorch.py
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tensorrt_llm/kernels/lora/lora.cpp
  • cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h
  • cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h
  • cpp/tensorrt_llm/thop/loraOp.cpp
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • cpp/tensorrt_llm/kernels/lora/lora.cpp
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h
  • tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py
  • tensorrt_llm/_torch/pyexecutor/guided_decoder.py
  • cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu
  • tests/unittest/llmapi/lora_test_utils.py
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h
  • cpp/tensorrt_llm/thop/loraOp.cpp
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h
  • tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu
  • tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_utils.py
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • tensorrt_llm/_torch/peft/lora/layer.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • tests/unittest/llmapi/test_llm_pytorch.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py
  • tensorrt_llm/_torch/pyexecutor/guided_decoder.py
  • tests/unittest/llmapi/lora_test_utils.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py
  • tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/_torch/peft/lora/layer.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tests/unittest/llmapi/test_llm_pytorch.py
**/*.{h,hpp,hh,hxx}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.

Files:

  • cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h
**/*.{h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).

Files:

  • cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h
  • cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h
  • cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h
🧠 Learnings (2)
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
PR: NVIDIA/TensorRT-LLM#7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/thop/loraOp.cpp
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
🧬 Code graph analysis (18)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
tensorrt_llm/_utils.py (1)
  • nvtx_pytorch_emit (54-66)
tensorrt_llm/_torch/pyexecutor/seq_slot_manager.py (1)
  • SeqSlotManager (6-32)
cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (2)
  • isTaskCachedDevice (487-490)
  • isTaskCachedDevice (487-487)
tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py (2)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1)
  • PeftCacheManager (231-255)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
  • PeftCacheManager (1135-1245)
  • is_task_cached_device (1244-1245)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu (2)
tensorrt_llm/_torch/peft/lora/layer.py (1)
  • slot_offsets (85-86)
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h (1)
  • ceilDiv (41-44)
tests/unittest/llmapi/lora_test_utils.py (3)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (4)
  • CudaGraphLoraParams (30-370)
  • get_slot_counts (313-324)
  • get_offset_from_counts (286-310)
  • get_sorted_indices (175-187)
tensorrt_llm/_torch/peft/lora/layer.py (5)
  • GroupedGemmParamsInput (71-86)
  • LoraLayer (244-1110)
  • compare_grouped_gemm_params (89-163)
  • _prepare_grouped_gemm_buffers_fused (407-474)
  • prepare_grouped_gemm_buffers (285-405)
tensorrt_llm/llmapi/llm_args.py (1)
  • CudaGraphConfig (109-166)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h (2)
cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h (2)
  • tensorrt_llm (23-62)
  • kernels (25-61)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu (2)
  • launchLoraGroupGEMMParamFillRowReorderFusion (354-412)
  • launchLoraGroupGEMMParamFillRowReorderFusion (354-361)
cpp/tensorrt_llm/thop/loraOp.cpp (2)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu (2)
  • launchLoraGroupGEMMParamFillRowReorderFusion (354-412)
  • launchLoraGroupGEMMParamFillRowReorderFusion (354-361)
cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu (4)
  • cuda_graph_splitk_grouped_gemm (320-379)
  • cuda_graph_splitk_grouped_gemm (320-323)
  • cuda_graph_grouped_gemm (143-202)
  • cuda_graph_grouped_gemm (143-146)
tensorrt_llm/_torch/pyexecutor/_util.py (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
  • _init_cuda_graph_lora_manager (373-391)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
  • lora_config (1082-1100)
cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h (2)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h (2)
  • tensorrt_llm (23-75)
  • kernels (25-74)
cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu (4)
  • cuda_graph_grouped_gemm (143-202)
  • cuda_graph_grouped_gemm (143-146)
  • cuda_graph_splitk_grouped_gemm (320-379)
  • cuda_graph_splitk_grouped_gemm (320-323)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py (6)
tensorrt_llm/lora_manager.py (2)
  • LoraManager (639-1242)
  • LoraModelConfig (241-246)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • AttentionMetadata (40-336)
  • num_seqs (245-249)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
  • PeftCacheManager (1135-1245)
  • get_and_reset_batch_peft_table (1238-1242)
tensorrt_llm/_torch/pyexecutor/scheduler.py (1)
  • ScheduledRequests (18-39)
tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py (5)
  • AdapterSlotManager (15-137)
  • update_slots (92-120)
  • get_slot_to_task_mapping (122-129)
  • has_slots_changed (131-133)
  • reset_changed_flag (135-137)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (5)
  • CudaGraphLoraParams (30-370)
  • LoraLayerInfo (41-51)
  • update_sorted_indices (189-216)
  • update_weight_pointers (218-283)
  • update_slots_params (326-341)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (1)
tensorrt_llm/_torch/peft/lora/layer.py (1)
  • slot_offsets (85-86)
tensorrt_llm/executor/worker.py (1)
tensorrt_llm/_utils.py (3)
  • mpi_comm (504-505)
  • mpi_rank (538-545)
  • nvtx_pytorch_emit (54-66)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (2)
  • isTaskCachedDevice (487-490)
  • isTaskCachedDevice (487-487)
tensorrt_llm/_torch/peft/lora/layer.py (2)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (4)
  • CudaGraphLoraParams (30-370)
  • get_offset_from_counts (286-310)
  • get_layer_params (359-370)
  • get_problem_count (343-357)
cpp/tensorrt_llm/thop/loraOp.cpp (4)
  • lora_grouped_gemm_cuda_graph (178-285)
  • lora_grouped_gemm_cuda_graph (178-190)
  • lora_grouped_gemm (54-176)
  • lora_grouped_gemm (54-58)
tensorrt_llm/_torch/pyexecutor/model_engine.py (6)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py (2)
  • CudaGraphLoraManager (23-191)
  • prepare_cuda_graph_lora_params (128-191)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (5)
  • PeftCacheManager (1135-1245)
  • ResourceManager (1097-1132)
  • ResourceManagerType (49-54)
  • get_and_reset_batch_peft_table (1238-1242)
  • get_resource_manager (1109-1110)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • batch_size (35-36)
  • ScheduledRequests (18-39)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
  • attn_metadata (121-122)
  • needs_capture (193-195)
  • replay (268-302)
tensorrt_llm/_torch/attention_backend/interface.py (1)
  • AttentionMetadata (40-336)
tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py (1)
  • remove_evicted_slots_in_cpp (83-90)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
  • tensorrt_llm (39-275)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
  • tensorrt_llm (49-52)
  • tensorrt_llm (54-168)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (2)
  • isTaskCachedDevice (487-490)
  • isTaskCachedDevice (487-487)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tests/unittest/llmapi/lora_test_utils.py (5)
  • create_mock_nemo_lora_checkpoint (132-242)
  • compare_cuda_graph_lora_params_filler (360-378)
  • CUDAGraphLoRATestParams (246-284)
  • module_count (271-272)
  • slot_count (275-276)
🪛 Ruff (0.13.3)
tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py

118-118: Unpacked variable evicted_task is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tests/unittest/llmapi/lora_test_utils.py

288-297: Do not perform function call CUDAGraphLoRATestParams in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py

144-144: Unused method argument: key

(ARG002)


145-145: Unused method argument: module_output_sizes

(ARG002)


183-183: Unpacked variable sorted_slot_ids is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tensorrt_llm/_torch/peft/lora/layer.py

93-93: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


98-98: Unpacked variable input_hidden_size is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


290-290: Consider (*shape_2d, 3) instead of concatenation

Replace with (*shape_2d, 3)

(RUF005)


413-413: Consider (*shape_2d, 3) instead of concatenation

Replace with (*shape_2d, 3)

(RUF005)


482-482: Consider (*shape_2d, 3) instead of concatenation

Replace with (*shape_2d, 3)

(RUF005)


593-593: f-string without any placeholders

Remove extraneous f prefix

(F541)


610-610: f-string without any placeholders

Remove extraneous f prefix

(F541)


612-612: Undefined name reordered_input

(F821)


618-618: f-string without any placeholders

Remove extraneous f prefix

(F541)


637-637: Undefined name ldb

(F821)


637-637: Undefined name ldb

(F821)


642-642: Undefined name ldd

(F821)


642-642: Undefined name ldd

(F821)


819-819: Undefined name reordered_input

(F821)


820-820: Undefined name in_sizes

(F821)


821-821: Undefined name out_sizes

(F821)


822-822: Undefined name a_offset

(F821)


826-826: Undefined name d_offset

(F821)


830-830: Undefined name d_prime_offset

(F821)


832-832: Undefined name reordered_input

(F821)


840-840: Undefined name lda

(F821)


841-841: Undefined name ldb

(F821)


842-842: Undefined name ldd

(F821)


843-843: Undefined name ldb_prime

(F821)


844-844: Undefined name ldd_prime

(F821)


847-847: Undefined name splitk_offsets

(F821)


969-969: Undefined name splitk_offsets

(F821)


1103-1106: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (14)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (2)

487-490: LGTM! Device cache lookup follows existing patterns.

The isTaskCachedDevice implementation correctly delegates to mDeviceLoraCache->has(taskId), mirroring the pattern used by isTaskCached for host cache and isTaskDoneDevice for device status.


465-466: Verify ensureBatch callers expect taskId-based keys
ensureBatch is only invoked in the kvCacheManager nanobind/pybind wrappers, which forward the result to Python. Since peftTable now uses taskId (not requestId) as its key, confirm all downstream Python consumers of ensure_batch have been updated to handle taskId-based mappings.

cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h (1)

118-118: LGTM! API declaration is consistent with existing patterns.

The isTaskCachedDevice method declaration:

  • Properly uses [[nodiscard]] attribute
  • Correctly marked const
  • Mirrors the existing isTaskDoneDevice pattern
  • Follows naming conventions
tensorrt_llm/executor/worker.py (1)

14-14: LGTM! Profiling instrumentation added correctly.

The nvtx_pytorch_emit decorator addition:

  • Properly imported from _utils
  • Applied to worker_main entry point for profiling visibility
  • Defaults to enabled=False, making it opt-in with no performance impact
  • Aligns with NVTX instrumentation strategy across the PR

Also applies to: 232-232

tensorrt_llm/_utils.py (1)

54-66: LGTM! Decorator factory implementation is correct.

The nvtx_pytorch_emit decorator factory:

  • Follows standard Python decorator factory pattern
  • Correctly uses functools.wraps to preserve function metadata
  • Properly delegates to torch.autograd.profiler.emit_nvtx
  • Safe defaults (enabled=False) ensure no performance impact unless explicitly enabled
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

533-535: LGTM! Python binding follows established patterns.

The is_task_cached_device binding:

  • Correctly uses nb::call_guard<nb::gil_scoped_release>() to release GIL during the call
  • Follows the same pattern as the existing is_task_cached binding
  • Properly exposes the C++ isTaskCachedDevice method to Python
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)

526-528: LGTM! Pybind11 binding mirrors nanobind implementation.

The is_task_cached_device binding:

  • Correctly uses py::call_guard<py::gil_scoped_release>() to release GIL
  • Follows the same pattern as existing is_task_cached binding
  • Mirrors the nanobind implementation in the other binding file
  • Properly exposes PeftCacheManager::isTaskCachedDevice to Python
tests/unittest/llmapi/lora_test_utils.py (1)

261-268: Validate slot_id sentinel behavior.

You allow slot_id == slot_count (Line 261). Please confirm this is intentional to represent base model (ignored) rows, given get_slot_counts truncates counts to max_lora_size.

cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu (1)

99-107: Avoid per-call workspace allocation during CUDA Graph capture
Dynamic calls to at::empty inside capture can break CUDA Graph replay unless the allocator is warmed up or guaranteed capture-safe. Pre-allocate or accept an external workspace (pointer/size), or document that workspace size is zeroed on warmup. Verify that any dynamic allocation here won’t invalidate graph capture.

cpp/tensorrt_llm/thop/loraOp.cpp (1)

205-213: Confirm intent: A' pointers reuse D offsets

a_prime_ptrs_gpu is initialized from d_offsets (Line 210), implying A' = D for the second GEMM. This matches the design (LoRA-out uses intermediate as A). If intentional, LGTM. If not, switch to a_offsets.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

2046-2080: LoRA manager initialization and routing LGTM
Initialization via _init_cuda_graph_lora_manager in _util.py is present; CUDA Graph and legacy paths are correctly handled.

tests/unittest/llmapi/test_llm_pytorch.py (3)

20-23: LGTM!

The new imports are appropriate for adding CUDA graph LoRA testing capabilities. The utilities from lora_test_utils and replace from dataclasses are used consistently throughout the file.

Also applies to: 46-46


273-326: LGTM!

The test comprehensively covers multiple edge cases for CUDA graph LoRA parameter filling:

  • All requests with the same LoRA ID
  • No LoRA in batch
  • Multiple modules
  • Invalid weight pointers
  • Mixed slot IDs

The use of replace from dataclasses to create test variations is clean and maintainable.


362-365: LGTM!

The widespread updates to test functions follow a consistent pattern:

  • Tests now accept cuda_graph_config parameter
  • Using @test_lora_with_and_without_cuda_graph decorator ensures tests run with both CUDA graph enabled and disabled
  • LLM initializations consistently pass cuda_graph_config
  • Helper functions updated to accept **llm_kwargs for flexibility

This comprehensive approach ensures CUDA graph support is tested across diverse LoRA configurations (LoraConfig variants, NeMo LoRA, CodeLlama LoRA, Gemma, Bielik, etc.).

Also applies to: 368-375, 418-430, 433-445, 448-458, 461-482, 486-516, 519-535, 541-555, 575-617, 636-668, 685-726, 805-852

return 0;
}
} // namespace tensorrt_llm::batch_manager

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Address the TODO about merging C++ and Python LoRA management.

The TODO comment indicates incomplete integration between C++ LoRA caching and Python slot manager. This suggests:

  1. Potential duplication or inconsistency in state management
  2. Unclear ownership and synchronization between the two layers
  3. Risk of cache coherency issues

Consider creating a follow-up issue to track this work and document the current limitations.

Would you like me to open a new issue to track the C++/Python LoRA caching integration work?

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp around line 651, the TODO
about merging C++ and Python LoRA management indicates incomplete integration
that can cause duplicated state, unclear ownership, and cache coherency issues;
create a tracked follow-up issue describing the gap and limitations, then
refactor so that a single layer owns LoRA slot state (either C++ or Python) and
the other acts as a thin synchronized client: define and document ownership
rules, add explicit synchronization primitives or RPC/IPC calls to
update/observe slot state, add unit/integration tests for cache coherence, and
update the TODO with the new issue link and a brief migration plan.

* limitations under the License.
*/

#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add project include guards (in addition to or instead of pragma once).

Use TRTLLM_CUDA_GRAPH_GROUPED_GEMM_H guard per guidelines.

As per coding guidelines

-#pragma once
+#pragma once
+#ifndef TRTLLM_CUDA_GRAPH_GROUPED_GEMM_H
+#define TRTLLM_CUDA_GRAPH_GROUPED_GEMM_H
@@
-} // namespace kernels
-} // namespace tensorrt_llm
+} // namespace kernels
+} // namespace tensorrt_llm
+
+#endif // TRTLLM_CUDA_GRAPH_GROUPED_GEMM_H

Also applies to: 61-62

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h around line 17 (and also
apply the same change for the occurrences at lines 61-62), the header relies
solely on #pragma once; add traditional include guards using the macro name
TRTLLM_CUDA_GRAPH_GROUPED_GEMM_H: insert an #ifndef
TRTLLM_CUDA_GRAPH_GROUPED_GEMM_H / #define TRTLLM_CUDA_GRAPH_GROUPED_GEMM_H at
the top (before any declarations) and a matching #endif at the end of the file,
keeping or removing #pragma once per project preference so the guard macro is
present as required by coding guidelines.

Comment on lines +52 to +55
slot_id = self.task2slot.pop(task_id, default=None)
if slot_id is not None:
self.slots_changed = True
self.slot2task[slot_id] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix positional-only pop call crashing remove_task

OrderedDict.pop takes its default as a positional-only argument (/ in the signature since Python 3.8). Passing it as default= raises TypeError, so any call to remove_task currently blows up instead of returning None. Please switch to the positional form.

-        slot_id = self.task2slot.pop(task_id, default=None)
+        slot_id = self.task2slot.pop(task_id, None)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
slot_id = self.task2slot.pop(task_id, default=None)
if slot_id is not None:
self.slots_changed = True
self.slot2task[slot_id] = None
slot_id = self.task2slot.pop(task_id, None)
if slot_id is not None:
self.slots_changed = True
self.slot2task[slot_id] = None
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py around lines 52 to 55,
the call self.task2slot.pop(task_id, default=None) uses the keyword "default"
which is invalid for OrderedDict.pop (positional-only default); change the call
to pass None positionally (self.task2slot.pop(task_id, None)) so remove_task
returns None instead of raising TypeError, keep the rest of the logic intact.

Comment on lines +55 to +67
self.adapter_slot_manager = AdapterSlotManager(max_lora_size, device)
self.lora_model_config = lora_model_config
lora_target_modules = lora_model_config.lora_target_modules
self.target_modules_ids: Optional[tuple[int, ...]] = tuple(
map(LoraManager.LORA_MODULE_IDS.__getitem__,
lora_target_modules)) if bool(lora_target_modules) else None
# print(f"config target modules ids: {lora_model_config.lora_target_modules}")
# print(f"converted target modules ids: {self.target_modules_ids}")
if not self.target_modules_ids:
logger.debug(
"No LoRA target modules provided in LoRA config, using all modules in PyTorch Module!"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard Optional lora_model_config.

None will raise here. Handle missing config by enabling all modules.

-        self.lora_model_config = lora_model_config
-        lora_target_modules = lora_model_config.lora_target_modules
-        self.target_modules_ids: Optional[tuple[int, ...]] = tuple(
-            map(LoraManager.LORA_MODULE_IDS.__getitem__,
-                lora_target_modules)) if bool(lora_target_modules) else None
+        self.lora_model_config = lora_model_config
+        lora_target_modules = (lora_model_config.lora_target_modules
+                               if lora_model_config is not None else None)
+        self.target_modules_ids: Optional[tuple[int, ...]] = (
+            tuple(map(LoraManager.LORA_MODULE_IDS.__getitem__,
+                      lora_target_modules))
+            if lora_target_modules else None
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.adapter_slot_manager = AdapterSlotManager(max_lora_size, device)
self.lora_model_config = lora_model_config
lora_target_modules = lora_model_config.lora_target_modules
self.target_modules_ids: Optional[tuple[int, ...]] = tuple(
map(LoraManager.LORA_MODULE_IDS.__getitem__,
lora_target_modules)) if bool(lora_target_modules) else None
# print(f"config target modules ids: {lora_model_config.lora_target_modules}")
# print(f"converted target modules ids: {self.target_modules_ids}")
if not self.target_modules_ids:
logger.debug(
"No LoRA target modules provided in LoRA config, using all modules in PyTorch Module!"
)
self.adapter_slot_manager = AdapterSlotManager(max_lora_size, device)
self.lora_model_config = lora_model_config
lora_target_modules = (
lora_model_config.lora_target_modules
if lora_model_config is not None else None
)
self.target_modules_ids: Optional[tuple[int, ...]] = (
tuple(
map(LoraManager.LORA_MODULE_IDS.__getitem__,
lora_target_modules)
)
if lora_target_modules else None
)
# print(f"config target modules ids: {lora_model_config.lora_target_modules}")
# print(f"converted target modules ids: {self.target_modules_ids}")
if not self.target_modules_ids:
logger.debug(
"No LoRA target modules provided in LoRA config, using all modules in PyTorch Module!"
)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py around lines 55 to
67, the code assumes lora_model_config is non-None and will raise if it's
missing; update the logic to guard against a None lora_model_config by treating
it as “no target modules provided” so all modules are used: first check if
lora_model_config is None and if so set lora_target_modules to a falsy value
(e.g. None or empty list); otherwise read lora_target_modules from
lora_model_config; then compute self.target_modules_ids only when
lora_target_modules is truthy (keep the existing map logic) and leave it as None
when not provided so the subsequent debug message and behavior use all modules.
Ensure no attribute access is performed on a None lora_model_config.

Comment on lines +71 to +74
self.layer_info: Dict[
CudaGraphLoraParams.LoraLayerKey, CudaGraphLoraParams.
LoraLayerInfo] | None = None # TODO: key value struct init from pytorch model's structure, value's content init from sample peft table; use target LoRA modules to only init valid modules
self._initialize_from_model(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use Python 3.8-compatible typing for unions.

Replace | None with Optional[...] typing.

As per coding guidelines

-        self.layer_info: Dict[
-            CudaGraphLoraParams.LoraLayerKey, CudaGraphLoraParams.
-            LoraLayerInfo] | None = None  # TODO: key value struct init from pytorch model's structure, value's content init from sample peft table; use target LoRA modules to only init valid modules
+        self.layer_info: Optional[Dict[
+            CudaGraphLoraParams.LoraLayerKey, CudaGraphLoraParams.
+            LoraLayerInfo]] = None  # TODO: key value struct init from pytorch model's structure, value's content init from sample peft table; use target LoRA modules to only init valid modules

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py around lines 71 to
74, the type annotation uses the Python 3.10 union operator `| None` which is
not compatible with Python 3.8; change the annotation to use
typing.Optional[...] (e.g., Optional[Dict[CudaGraphLoraParams.LoraLayerKey,
CudaGraphLoraParams.LoraLayerInfo]]) and ensure Optional is imported from typing
at the top of the file, keeping the existing comment and initialization call
intact.

Comment on lines 1177 to 1215
self._batch_peft_table: Optional[Dict[int, list[
TaskLayerModuleConfig]]] = None # task_id -> layer-module-configs mapping for the current batch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use Python 3.8-compatible typing for lists.

Replace builtin list[...] with typing.List[...].

As per coding guidelines

-        self._batch_peft_table: Optional[Dict[int, list[
-            TaskLayerModuleConfig]]] = None  # task_id -> layer-module-configs mapping for the current batch
+        self._batch_peft_table: Optional[Dict[int, List[
+            TaskLayerModuleConfig]]] = None  # task_id -> layer-module-configs mapping for the current batch
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self._batch_peft_table: Optional[Dict[int, list[
TaskLayerModuleConfig]]] = None # task_id -> layer-module-configs mapping for the current batch
self._batch_peft_table: Optional[Dict[int, List[
TaskLayerModuleConfig]]] = None # task_id -> layer-module-configs mapping for the current batch
🤖 Prompt for AI Agents
tensorrt_llm/_torch/pyexecutor/resource_manager.py around lines 1177 to 1179:
the type annotation uses Python 3.9+ builtin list[...] which is not compatible
with Python 3.8; replace list[TaskLayerModuleConfig] with
typing.List[TaskLayerModuleConfig] and ensure List is imported (e.g., add List
to the existing typing imports alongside Optional and Dict) so the annotation
becomes Optional[Dict[int, List[TaskLayerModuleConfig]]].

@@ -1,16 +1,24 @@
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add NVIDIA Apache-2.0 header.

Prepend the standard NVIDIA Apache-2.0 header with the current year to this Python file.

As per coding guidelines

🤖 Prompt for AI Agents
In tests/unittest/llmapi/lora_test_utils.py around lines 1 to 1, the file is
missing the required NVIDIA Apache-2.0 license header; prepend the standard
NVIDIA Apache-2.0 header block at the top of the file (using the current year
2025) before any imports, ensuring the header exactly matches the project's
canonical text and includes the correct copyright line and license URL.

Comment on lines +247 to +306
batch_slot_ids: list[int]
input_hidden_size: int
slot_ranks: list[int]
max_lora_rank: int
output_hidden_sizes: list[int]
layer_module_mask: torch.Tensor | None | bool
dtype: torch.dtype
seed: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use Python 3.8-compatible typing (avoid PEP 585/604).

Replace builtin generics and union | with typing aliases to keep 3.8 compatibility.

As per coding guidelines

Apply this diff:

-from dataclasses import dataclass
+from dataclasses import dataclass
+from typing import List, Tuple, Optional, Union
@@
-@dataclass
-class CUDAGraphLoRATestParams:
-    batch_slot_ids: list[int]
-    input_hidden_size: int
-    slot_ranks: list[int]
-    max_lora_rank: int
-    output_hidden_sizes: list[int]
-    layer_module_mask: torch.Tensor | None | bool
-    dtype: torch.dtype
-    seed: int
+@dataclass
+class CUDAGraphLoRATestParams:
+    batch_slot_ids: List[int]
+    input_hidden_size: int
+    slot_ranks: List[int]
+    max_lora_rank: int
+    output_hidden_sizes: List[int]
+    layer_module_mask: Optional[Union[torch.Tensor, bool]]
+    dtype: torch.dtype
+    seed: int
@@
-) -> tuple[GroupedGemmParamsInput, LoraLayer]:
+) -> Tuple[GroupedGemmParamsInput, LoraLayer]:

Also applies to: 298-298

🤖 Prompt for AI Agents
In tests/unittest/llmapi/lora_test_utils.py around lines 247-254 and line 298,
the type annotations use Python 3.10+ syntax (PEP 585/604) like list[int] and
torch.Tensor | None | bool which breaks 3.8 compatibility; change those to use
typing imports: replace list[int] with typing.List[int], use typing.Union or
Optional for unions (e.g., Optional[torch.Tensor] or Union[torch.Tensor, bool,
None]), and replace torch.dtype with the same typing style if needed by
importing typing.Optional, typing.Union, and typing.List at the top; update the
two locations accordingly so all annotations are compatible with Python 3.8.

Comment on lines +287 to +350
def create_grouped_gemm_params_filler_input(
test_params: CUDAGraphLoRATestParams = CUDAGraphLoRATestParams(
batch_slot_ids=[0, 3, 3, 4, 5, 8],
input_hidden_size=4096,
slot_ranks=[8, 12, 4, 3] * 2,
max_lora_rank=64,
output_hidden_sizes=[4096, 4096],
layer_module_mask=None,
dtype=torch.bfloat16,
seed=42,
),
) -> tuple[GroupedGemmParamsInput, LoraLayer]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Avoid function calls in default arguments (B008) and build defaults inside.

Default-constructing CUDAGraphLoRATestParams at definition time is unsafe. Use None and initialize inside the function.

-def create_grouped_gemm_params_filler_input(
-    test_params: CUDAGraphLoRATestParams = CUDAGraphLoRATestParams(
-        batch_slot_ids=[0, 3, 3, 4, 5, 8],
-        input_hidden_size=4096,
-        slot_ranks=[8, 12, 4, 3] * 2,
-        max_lora_rank=64,
-        output_hidden_sizes=[4096, 4096],
-        layer_module_mask=None,
-        dtype=torch.bfloat16,
-        seed=42,
-    ),
-) -> tuple[GroupedGemmParamsInput, LoraLayer]:
+def create_grouped_gemm_params_filler_input(
+    test_params: Optional[CUDAGraphLoRATestParams] = None,
+) -> Tuple[GroupedGemmParamsInput, LoraLayer]:
+    if test_params is None:
+        test_params = CUDAGraphLoRATestParams(
+            batch_slot_ids=[0, 3, 3, 4, 5, 8],
+            input_hidden_size=4096,
+            slot_ranks=[8, 12, 4, 3] * 2,
+            max_lora_rank=64,
+            output_hidden_sizes=[4096, 4096],
+            layer_module_mask=None,
+            dtype=torch.bfloat16,
+            seed=42,
+        )

[Based on learnings]

Also applies to: 300-301

🧰 Tools
🪛 Ruff (0.13.3)

288-297: Do not perform function call CUDAGraphLoRATestParams in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🤖 Prompt for AI Agents
In tests/unittest/llmapi/lora_test_utils.py around lines 287 to 298 (and
similarly for lines 300-301), the function uses a call to construct
CUDAGraphLoRATestParams as a default argument which is unsafe; change the
signature to accept test_params: Optional[CUDAGraphLoRATestParams] = None and
inside the function check if test_params is None and then instantiate
CUDAGraphLoRATestParams with the same default field values currently used;
ensure you import Optional from typing if needed and update other functions at
lines 300-301 the same way to avoid evaluating constructors at definition time.

Comment on lines +366 to +438
assert torch.all(
grouped_gemm_params_filler_input.intermediate_buffer == 0
), f"intermediate_buffer is not all zeros: {grouped_gemm_params_filler_input.intermediate_buffer}; non zero / zeros: {(grouped_gemm_params_filler_input.output_buffer != 0).sum()} / {grouped_gemm_params_filler_input.output_buffer.numel()}"
assert torch.all(
grouped_gemm_params_filler_input.output_buffer == 0
), f"output_buffer is not all zeros: {grouped_gemm_params_filler_input.output_buffer}; non zero / zeros: {(grouped_gemm_params_filler_input.output_buffer != 0).sum()} / {grouped_gemm_params_filler_input.output_buffer.numel()}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix buffer name in error message.

Intermediate buffer assertion message prints counts from output_buffer. Use intermediate_buffer for both.

-    ), f"intermediate_buffer is not all zeros: {grouped_gemm_params_filler_input.intermediate_buffer}; non zero / zeros: {(grouped_gemm_params_filler_input.output_buffer != 0).sum()} / {grouped_gemm_params_filler_input.output_buffer.numel()}"
+    ), f"intermediate_buffer is not all zeros: {grouped_gemm_params_filler_input.intermediate_buffer}; non zero / zeros: {(grouped_gemm_params_filler_input.intermediate_buffer != 0).sum()} / {grouped_gemm_params_filler_input.intermediate_buffer.numel()}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert torch.all(
grouped_gemm_params_filler_input.intermediate_buffer == 0
), f"intermediate_buffer is not all zeros: {grouped_gemm_params_filler_input.intermediate_buffer}; non zero / zeros: {(grouped_gemm_params_filler_input.output_buffer != 0).sum()} / {grouped_gemm_params_filler_input.output_buffer.numel()}"
assert torch.all(
grouped_gemm_params_filler_input.output_buffer == 0
), f"output_buffer is not all zeros: {grouped_gemm_params_filler_input.output_buffer}; non zero / zeros: {(grouped_gemm_params_filler_input.output_buffer != 0).sum()} / {grouped_gemm_params_filler_input.output_buffer.numel()}"
assert torch.all(
grouped_gemm_params_filler_input.intermediate_buffer == 0
), f"intermediate_buffer is not all zeros: {grouped_gemm_params_filler_input.intermediate_buffer}; non zero / zeros: {(grouped_gemm_params_filler_input.intermediate_buffer != 0).sum()} / {grouped_gemm_params_filler_input.intermediate_buffer.numel()}"
assert torch.all(
grouped_gemm_params_filler_input.output_buffer == 0
), f"output_buffer is not all zeros: {grouped_gemm_params_filler_input.output_buffer}; non zero / zeros: {(grouped_gemm_params_filler_input.output_buffer != 0).sum()} / {grouped_gemm_params_filler_input.output_buffer.numel()}"
🤖 Prompt for AI Agents
In tests/unittest/llmapi/lora_test_utils.py around lines 366 to 371, the first
assert's failure message mistakenly reports non-zero/total counts for
output_buffer instead of intermediate_buffer; update that f-string to reference
grouped_gemm_params_filler_input.intermediate_buffer for both the printed tensor
and the non-zero/numel counts so the error message correctly shows
intermediate_buffer details.

Implement CUDA Graph compatible multi LoRAs

Signed-off-by: Jiayu Chang <[email protected]>

Refactor CUDA Graph LoRA integration to support precomputed leading dimensions

- Updated `cuda_graph_grouped_gemm` and `cuda_graph_splitk_grouped_gemm` functions to accept leading dimension pointers for A, B, C, and D matrices.
- Modified `LoraImpl` to retrieve and pass leading dimension pointers during GEMM operations.
- Enhanced `CudaGraphLoraParams` to manage leading dimensions for each layer and module.
- Adjusted `CudaGraphLoraManager` to initialize parameters based on actual layer configurations from the PEFT table.
- Improved handling of layer-specific parameters to ensure compatibility with CUDA Graph operations.

This refactor aims to optimize performance by leveraging precomputed leading dimensions, reducing overhead during GEMM execution.

Signed-off-by: Jiayu Chang <[email protected]>

bug fixes

Signed-off-by: Jiayu Chang <[email protected]>

Move input prep to graph

Signed-off-by: Jiayu Chang <[email protected]>

Fix bug in adapter size

Signed-off-by: Jiayu Chang <[email protected]>

Pass all but `test_llama_7b_lora_config_overrides_peft_cache_config` on L40s
Graph seems to capture code outside of the captured function?

Signed-off-by: Jiayu Chang <[email protected]>

Pass all tests

Signed-off-by: Jiayu Chang <[email protected]>

sync slot manager with c++

Signed-off-by: Jiayu Chang <[email protected]>

Update kernel alignment selection

Signed-off-by: Jiayu Chang <[email protected]>

Fix kernel workspace sizes

Signed-off-by: Jiayu Chang <[email protected]>

memcpy use pinned memory; remove assert in slot manager eviction

Signed-off-by: Jiayu Chang <[email protected]>

Add param fill fused kernel

Signed-off-by: Jiayu Chang <[email protected]>

Disable torch nvtx emit

Signed-off-by: Jiayu Chang <[email protected]>

Disable init manager without cuda graph

Signed-off-by: Jiayu Chang <[email protected]>

Update CI

Signed-off-by: Jiayu Chang <[email protected]>

Moved files

Signed-off-by: Jiayu Chang <[email protected]>
Signed-off-by: Jiayu Chang <[email protected]>
@JyChang012 JyChang012 force-pushed the feat/cuda_graph_multiLora branch from d22f2a6 to f14382b Compare October 14, 2025 17:24
@JyChang012
Copy link
Member Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21382 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21382 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #16147 completed with status: 'FAILURE'

Signed-off-by: Jiayu Chang <[email protected]>
@JyChang012 JyChang012 force-pushed the feat/cuda_graph_multiLora branch from f14382b to 7d7d903 Compare October 14, 2025 19:53
@JyChang012
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21388 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21388 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #16153 completed with status: 'FAILURE'

Signed-off-by: Jiayu Chang <[email protected]>
@JyChang012
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21435 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21435 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #16187 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants