Skip to content

Conversation

@youngeunkwon0405
Copy link
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Nov 18, 2025

What does this PR do ?

This feature will enable visualization of important per-generation-worker performance metrics (IFB size, pending requests for now).
We can expand this to visualize other metrics such as kv cache related, spec dec related (if you employ that). Theoretically, all the detailed performance metrics mentioned in https://docs.vllm.ai/en/v0.7.2/serving/metrics.html.

Here is a visualization example:

image

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Added vLLM metrics collection and injection into training metrics so performance stats are captured per generation step.
    • New configurable settings to enable vLLM metrics logging and adjust collection interval.
    • Background metrics logging on model-owner workers with optional Prometheus-compatible hooks.
    • Enhanced performance visualization to display vLLM metrics (inflight batch sizes, pending samples) per data-parallel shard.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 18, 2025

📝 Walkthrough

Walkthrough

Adds optional vLLM metrics logging: config flags, a background metrics logger in vLLM workers, RPCs to fetch/clear collected metrics, propagation of those metrics through GRPO training, and visualization support in performance output.

Changes

Cohort / File(s) Summary
Configuration
examples/configs/grpo_math_1B.yaml
Added enable_vllm_metrics_logger: false and vllm_metrics_logger_interval: 0.5.
Generation API
nemo_rl/models/generation/vllm/vllm_generation.py
Added get_vllm_logger_metrics() and clear_vllm_logger_metrics() to collect/reset per-DP vLLM metrics via worker RPCs.
vLLM Worker (sync)
nemo_rl/models/generation/vllm/vllm_worker.py
Spawn background metrics logger thread when enabled; implement _start_vllm_metrics_logger(), storage for inflight_batch_sizes and num_pending_samples, and public get_/clear_vllm_logger_metrics() methods.
vLLM Worker (async)
nemo_rl/models/generation/vllm/vllm_worker_async.py
Conditionally construct stat_loggers (using PrometheusStatLogger) and pass stat_loggers into AsyncLLM.from_engine_args(...).
Training integration
nemo_rl/algorithms/grpo.py
Clear logger metrics at generation start/cycle boundaries and collect vllm_logger_metrics after generation; inject into per-step/batch training metrics under "vllm_logger_metrics".
Visualization
nemo_rl/algorithms/utils.py
Extended visualization to render vLLM metrics when present: dynamic bar length, timelines for inflight_batch_sizes, num_pending_samples, and helper timeline rendering functions; uses configured interval for time estimates.

Sequence Diagram(s)

sequenceDiagram
    participant GRPO as GRPO Training
    participant Gen as VllmGeneration
    participant Worker as VllmGenerationWorker
    participant vLLM as vLLM Engine

    rect rgb(240,248,255)
    GRPO->>Gen: request generation (start step)
    Gen->>Worker: RPC clear_vllm_logger_metrics()
    Worker-->>Gen: ack
    end

    rect rgb(245,255,240)
    Gen->>vLLM: perform generation (async/sync)
    vLLM-->>Worker: metrics exposed via Gauge
    Worker->>Worker: background thread polls metrics (interval)
    end

    rect rgb(255,250,240)
    GRPO->>Gen: generation finished
    Gen->>Worker: RPC get_vllm_logger_metrics()
    Worker-->>Gen: {inflight_batch_sizes, num_pending_samples}
    Gen-->>GRPO: return generation results + vllm_logger_metrics
    GRPO->>GRPO: attach metrics to training step metrics
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay extra attention to:
    • Thread lifecycle, stop conditions, and exception handling in vllm_worker.py.
    • Correctness of RPC usage and serialization of metric snapshots in vllm_generation.py.
    • Integration point in grpo.py for when metrics are cleared/collected relative to generation and refit/sync.
    • The AsyncLLM.from_engine_args call-site change in vllm_worker_async.py and compatibility with the async engine API.

Possibly related PRs

  • feat: Adding perf metrics #1183 — Modifies GRPO training flows and runtime metrics collection; overlaps with metrics collection and propagation changes in this PR.

Suggested labels

Performance

Suggested reviewers

  • parthchadha

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning PR introduces major features (metrics logging, thread background processing, visualization) without test results, performance measurements, or testing documentation. Add comprehensive unit tests for metrics methods and visualization functions, provide performance benchmarks for background thread overhead, validate training convergence with metrics enabled, resolve thread synchronization issues identified in review, update PR description with actual test results.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main feature additions: per-worker active/idle timeline visualization and IFB (in-flight batch) size logging capabilities.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch youngeunk/vllm-info

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@youngeunkwon0405 youngeunkwon0405 added the CI:L1 Run doctests, unit tests, and functional tests label Nov 18, 2025
@youngeunkwon0405 youngeunkwon0405 marked this pull request as draft November 18, 2025 06:56
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
nemo_rl/algorithms/utils.py (1)

396-518: Guard vLLM metrics visualization against disabled logger and empty data; address Ruff hints

Two functional issues plus a few style nits here:

  1. KeyError when vLLM metrics logger is disabled
  • VllmGeneration.get_vllm_logger_metrics returns {} when enable_vllm_metrics_logger is false.

  • GRPO unconditionally sets metrics["vllm_logger_metrics"] = vllm_logger_metrics, so you’ll see {} in metrics.

  • The check if vllm_logger_metrics is not None: will pass for {}, and then:

    vllm_metrics_logger_interval = master_config["policy"]["generation"]["vllm_cfg"]["vllm_metrics_logger_interval"]

    will raise KeyError for configs that don’t define vllm_metrics_logger_interval (all existing recipes except the one you updated).

  1. visualize_per_worker_timeline can crash on empty metrics
  • If metric_dict is {} or some lists inside it are empty, this line:

    max_value = max(max(v) for v in metric_dict.values())

    will raise ValueError.

  1. Ruff hints (F841/E731/B007)
  • dp_ranks is unused.
  • count_zeros is a lambda assigned to a name.
  • The i loop index is unused.

A minimal, robust fix that also satisfies Ruff:

@@
-    def resize_timeline(data, new_size):
+    def resize_timeline(data, new_size):
         old_size = len(data)
         x_old = np.linspace(0, 1, old_size)
         x_new = np.linspace(0, 1, new_size)
         return np.interp(x_new, x_old, data)
@@
-    def visualize_per_worker_timeline(
-        metric_dict: dict[int, list[int]],
-        metric_name: str,
-        timeline_interval: float | None,
-    ) -> None:
-        dp_ranks = list(metric_dict.keys())
+    def visualize_per_worker_timeline(
+        metric_dict: dict[int, list[int]],
+        metric_name: str,
+        timeline_interval: float | None,
+    ) -> None:
         max_timeline_length = 50
         marker = {0: "□", 1: "⧅", 2: "⛝", 3: "■"}
 
-        max_value = max(max(v) for v in metric_dict.values())
+        # Skip visualization cleanly if there is no data yet
+        non_empty_values = [v for v in metric_dict.values() if v]
+        if not non_empty_values:
+            print(f"  - {metric_name}: no data to display yet")
+            return
+
+        max_value = max(max(v) for v in non_empty_values)
         bin_width = (max_value + 1) / len(marker)
@@
-        print(f"  - {metric_name}:")
+        print(f"  - {metric_name}:")
         print(f"    - Max value: {max_value}")
         print("    - Timeline:")
         for dp_idx, metric_values in metric_dict.items():
             timeline = []
             length = len(metric_values)
             if timeline_interval is not None:
-                count_zeros = lambda x: sum(v == 0 for v in x)
-                idle = count_zeros(metric_values) * timeline_interval
+                def count_zeros(values: list[int]) -> int:
+                    return sum(v == 0 for v in values)
+
+                idle = count_zeros(metric_values) * timeline_interval
                 active = length * timeline_interval - idle
@@
-            else:
-                resized_metric_values = metric_values
-
-            for i, value in enumerate(resized_metric_values):
+            else:
+                resized_metric_values = metric_values
+
+            for value in resized_metric_values:
                 timeline.append(marker[min(int(value // bin_width), len(marker) - 1)])
@@
-    if "vllm_logger_metrics" in metrics:
+    if "vllm_logger_metrics" in metrics:
@@
-        vllm_logger_metrics = metrics["vllm_logger_metrics"]
-
-        if vllm_logger_metrics is not None:
-            vllm_metrics_logger_interval = master_config["policy"]["generation"][
-                "vllm_cfg"
-            ]["vllm_metrics_logger_interval"]
-            print("  • vLLM Logger Metrics:")
-            # Visualize the inflight batch sizes timeline
-            visualize_per_worker_timeline(
-                vllm_logger_metrics["inflight_batch_sizes"],
-                "Inflight Batch Sizes",
-                vllm_metrics_logger_interval,
-            )
-            max_num_pending_samples = max(
-                max(v) for v in vllm_logger_metrics["num_pending_samples"].values()
-            )
-            # If there is at least one pending sample, visualize the timeline
-            if max_num_pending_samples > 0:
-                visualize_per_worker_timeline(
-                    vllm_logger_metrics["num_pending_samples"],
-                    "Num Pending Samples",
-                    None,
-                )
+        vllm_logger_metrics = metrics["vllm_logger_metrics"]
+
+        # Treat {}, None, or missing metrics as "no data" and skip visualization
+        if vllm_logger_metrics:
+            vllm_cfg = master_config["policy"]["generation"].get("vllm_cfg", {})
+            vllm_metrics_logger_interval = vllm_cfg.get(
+                "vllm_metrics_logger_interval", 1.0
+            )
+            print("  • vLLM Logger Metrics:")
+
+            inflight_metrics = vllm_logger_metrics.get("inflight_batch_sizes", {})
+            if inflight_metrics:
+                visualize_per_worker_timeline(
+                    inflight_metrics,
+                    "Inflight Batch Sizes",
+                    vllm_metrics_logger_interval,
+                )
+
+            pending_metrics = vllm_logger_metrics.get("num_pending_samples", {})
+            if pending_metrics:
+                max_num_pending_samples = max(
+                    max(v) for v in pending_metrics.values()
+                )
+                if max_num_pending_samples > 0:
+                    visualize_per_worker_timeline(
+                        pending_metrics,
+                        "Num Pending Samples",
+                        None,
+                    )

This keeps the visualization logic intact but prevents crashes when the logger is disabled or has not yet produced data, and it resolves the Ruff warnings.

nemo_rl/algorithms/grpo.py (1)

1068-1122: Guard vLLM‑specific metric calls in grpo_train to avoid crashes on non‑vLLM backends

In grpo_train, the new calls:

policy_generation.clear_vllm_logger_metrics()
...
vllm_logger_metrics = policy_generation.get_vllm_logger_metrics()
...
metrics["vllm_logger_metrics"] = vllm_logger_metrics

are unconditionally invoked on policy_generation.

However, earlier in the function you explicitly allow policy_generation to be policy (megatron backend):

if policy_generation is None:
    policy_generation = policy  # type: ignore
    NEED_REFIT = False

Policy does not implement clear_vllm_logger_metrics / get_vllm_logger_metrics, so with a non‑vLLM backend this will raise AttributeError on every step.

A minimal fix is to detect support for these methods once and gate usage accordingly, while only attaching non‑empty metrics:

@@
-    if policy_generation is None:
-        policy_generation = policy  # type: ignore
-        NEED_REFIT = False
-    POLICY_GENERATION_STALE = True  # tracks if generation needs a refit before running
-    assert policy_generation is not None  # for mypy type check
+    if policy_generation is None:
+        policy_generation = policy  # type: ignore
+        NEED_REFIT = False
+    POLICY_GENERATION_STALE = True  # tracks if generation needs a refit before running
+    assert policy_generation is not None  # for mypy type check
+
+    # vLLM metrics are only available on vLLM-based generation backends
+    supports_vllm_logger_metrics = hasattr(
+        policy_generation, "get_vllm_logger_metrics"
+    ) and hasattr(policy_generation, "clear_vllm_logger_metrics")
@@
-                with timer.time("generation"):
-                    # Clear vLLM logger metrics for each generation step
-                    policy_generation.clear_vllm_logger_metrics()
+                with timer.time("generation"):
+                    # Clear vLLM logger metrics for each generation step (if supported)
+                    if supports_vllm_logger_metrics:
+                        policy_generation.clear_vllm_logger_metrics()
@@
-                    policy_generation.finish_generation()
-                    # Collect vLLM logger metrics for performance reporting after each generation step
-                    # inflight batch sizes and num pending samples are collected from each vLLM worker
-                    vllm_logger_metrics = policy_generation.get_vllm_logger_metrics()
+                    policy_generation.finish_generation()
+                    # Collect vLLM logger metrics for performance reporting (if supported)
+                    # inflight batch sizes and num pending samples are collected from each vLLM worker
+                    vllm_logger_metrics = (
+                        policy_generation.get_vllm_logger_metrics()
+                        if supports_vllm_logger_metrics
+                        else None
+                    )
@@
-                metrics.update(rollout_metrics)
-                metrics["vllm_logger_metrics"] = vllm_logger_metrics
+                metrics.update(rollout_metrics)
+                if vllm_logger_metrics:
+                    metrics["vllm_logger_metrics"] = vllm_logger_metrics

This keeps vLLM metrics for vLLM backends, while leaving Megatron (and any other non‑vLLM backend) unaffected.

Also applies to: 1340-1341

🧹 Nitpick comments (4)
nemo_rl/algorithms/grpo.py (1)

1927-1929: Async GRPO vLLM metrics wiring is reasonable; consider the same hasattr guard for robustness

In async_grpo_train:

  • You clear metrics once at the start and after each refit:
    policy_generation.clear_vllm_logger_metrics()
    ...
    policy_generation.clear_vllm_logger_metrics()
  • You collect them once per training step and attach to metrics:
    vllm_logger_metrics = policy_generation.get_vllm_logger_metrics()
    ...
    metrics["vllm_logger_metrics"] = vllm_logger_metrics

Given the function starts with:

assert _should_use_async_rollouts(master_config)

and async rollouts require vLLM, this is logically consistent and should work as intended.

For symmetry with grpo_train and to make the code more robust to future changes in how async_grpo_train is invoked, you could add the same capability check and only attach non‑empty metrics:

@@
-    POLICY_GENERATION_STALE = True
-    assert policy_generation is not None
+    POLICY_GENERATION_STALE = True
+    assert policy_generation is not None
+
+    supports_vllm_logger_metrics = hasattr(
+        policy_generation, "get_vllm_logger_metrics"
+    ) and hasattr(policy_generation, "clear_vllm_logger_metrics")
@@
-    # Clear vLLM logger metrics after at start of training
-    policy_generation.clear_vllm_logger_metrics()
+    # Clear vLLM logger metrics at start of training (if supported)
+    if supports_vllm_logger_metrics:
+        policy_generation.clear_vllm_logger_metrics()
@@
-                # Collect vLLM logger metrics for performance reporting
-                # inflight batch sizes and num pending samples are collected from each vLLM worker
-                vllm_logger_metrics = policy_generation.get_vllm_logger_metrics()
+                # Collect vLLM logger metrics for performance reporting (if supported)
+                # inflight batch sizes and num pending samples are collected from each vLLM worker
+                vllm_logger_metrics = (
+                    policy_generation.get_vllm_logger_metrics()
+                    if supports_vllm_logger_metrics
+                    else None
+                )
@@
-                # Clear vLLM logger metrics after each refit (weight sync), starting a new logging cycle
-                policy_generation.clear_vllm_logger_metrics()
+                # Clear vLLM logger metrics after each refit (weight sync), starting a new logging cycle
+                if supports_vllm_logger_metrics:
+                    policy_generation.clear_vllm_logger_metrics()
@@
-                metrics.update(rollout_metrics)
-                metrics["vllm_logger_metrics"] = vllm_logger_metrics
+                metrics.update(rollout_metrics)
+                if vllm_logger_metrics:
+                    metrics["vllm_logger_metrics"] = vllm_logger_metrics

This is mostly defensive, but keeps behavior clear and aligned with the sync GRPO path.

Also applies to: 2148-2151, 2172-2174, 2252-2253

nemo_rl/models/generation/vllm/vllm_generation.py (1)

822-865: vLLM logger metrics collection is well-structured; tighten zip with strict=True

The DP-leader fan-out (get_dp_leader_worker_idx + run_single_worker_single_data) and aggregation into "inflight_batch_sizes" / "num_pending_samples" per dp_idx look correct and are properly gated on enable_vllm_metrics_logger.

The project supports Python 3.10 as the minimum version, so you can safely use zip(..., strict=True) to catch any future mismatch between dp_indices and results, satisfying Ruff (B905):

-        for dp_idx, stats in zip(dp_indices, results):
+        for dp_idx, stats in zip(dp_indices, results, strict=True):
nemo_rl/models/generation/vllm/vllm_worker.py (2)

374-374: Extract magic number to a named constant.

The initial 2.0-second delay is hardcoded. Consider making it a configuration parameter or at minimum a named constant for clarity.

+        # Initial delay to allow vLLM engine to stabilize before collecting metrics
+        METRICS_LOGGER_INITIAL_DELAY_S = 2.0
+
         def _logger_loop():
             # Delay a little to let engine settle
-            time.sleep(min(2.0, interval_s))
+            time.sleep(min(METRICS_LOGGER_INITIAL_DELAY_S, interval_s))

380-382: Use proper logging framework instead of print statements.

The code uses print() for logging, which is inconsistent with the rest of the codebase (e.g., line 165 uses init_logger). Print statements also lack log levels, timestamps, and integration with log aggregation systems.

Consider initializing a logger at the module level or class level:

from vllm.logger import init_logger
logger = init_logger(__name__)

Then replace print statements with appropriate log levels:

  • logger.warning() for the exception handlers
  • logger.info() for the startup messages

Note: The emoji decorators may not render well in all environments, consider removing or making them optional.

Also applies to: 397-400, 404-407, 411-413, 420-422

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c32778d and e419d91.

📒 Files selected for processing (6)
  • examples/configs/grpo_math_1B.yaml (1 hunks)
  • nemo_rl/algorithms/grpo.py (7 hunks)
  • nemo_rl/algorithms/utils.py (2 hunks)
  • nemo_rl/models/generation/vllm/vllm_generation.py (1 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker.py (2 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker_async.py (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T14:20:36.297Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-8b-base-2n8g-fsdp2tp2.v1.yaml:113-120
Timestamp: 2025-09-18T14:20:36.297Z
Learning: In distillation workflows, the teacher policy does not perform generation - it only does inference/logprob computation on sequences generated by the student policy. Therefore, teacher generation configuration mismatches (like vLLM tensor parallelism settings) and colocation concerns are not relevant.

Applied to files:

  • nemo_rl/algorithms/grpo.py
🧬 Code graph analysis (4)
nemo_rl/models/generation/vllm/vllm_generation.py (2)
nemo_rl/models/generation/vllm/vllm_worker.py (2)
  • get_vllm_logger_metrics (425-432)
  • clear_vllm_logger_metrics (434-438)
nemo_rl/distributed/worker_groups.py (4)
  • dp_size (627-629)
  • get_dp_leader_worker_idx (404-411)
  • run_single_worker_single_data (631-656)
  • run_all_workers_single_data (755-799)
nemo_rl/algorithms/utils.py (1)
tests/check_metrics.py (2)
  • min (25-27)
  • max (30-32)
nemo_rl/algorithms/grpo.py (2)
nemo_rl/models/generation/vllm/vllm_generation.py (2)
  • clear_vllm_logger_metrics (858-865)
  • get_vllm_logger_metrics (822-856)
nemo_rl/models/generation/vllm/vllm_worker.py (2)
  • clear_vllm_logger_metrics (434-438)
  • get_vllm_logger_metrics (425-432)
nemo_rl/models/generation/vllm/vllm_worker.py (1)
nemo_rl/models/generation/vllm/vllm_generation.py (2)
  • get_vllm_logger_metrics (822-856)
  • clear_vllm_logger_metrics (858-865)
🪛 GitHub Actions: CICD NeMo RL
nemo_rl/models/generation/vllm/vllm_worker_async.py

[error] 1-1: Pre-commit hook 'ruff' failed. No such file or directory (os error 2). 1 file was modified by this hook. Command: 'pre-commit run --all-files --show-diff-on-failure --color=always'.

🪛 Ruff (0.14.5)
nemo_rl/models/generation/vllm/vllm_generation.py

844-844: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_rl/algorithms/utils.py

459-459: Local variable dp_ranks is assigned to but never used

Remove assignment to unused variable dp_ranks

(F841)


473-473: Do not assign a lambda expression, use a def

Rewrite count_zeros as a def

(E731)


483-483: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

nemo_rl/models/generation/vllm/vllm_worker.py

378-378: Do not catch blind exception: Exception

(BLE001)


396-396: Do not catch blind exception: Exception

(BLE001)


402-402: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-container / main
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: Lint check
🔇 Additional comments (4)
examples/configs/grpo_math_1B.yaml (1)

232-233: vLLM metrics logger config looks reasonable

The new enable_vllm_metrics_logger flag and vllm_metrics_logger_interval value are consistent with the runtime wiring in the vLLM generation/worker code and provide a sensible default (disabled, 0.5s interval).

nemo_rl/models/generation/vllm/vllm_worker_async.py (1)

129-148: AsyncLLM stat_loggers integration is correct and compatible

vLLM v1 exposes PrometheusStatLogger at vllm.v1.metrics.loggers.PrometheusStatLogger and AsyncLLM.from_engine_args accepts a stat_loggers parameter. The implementation correctly gates the logger on the enable_vllm_metrics_logger config flag and passes it to AsyncLLM.from_engine_args, keeping the async path clean when disabled.

nemo_rl/models/generation/vllm/vllm_worker.py (2)

19-20: LGTM!

The threading and time imports are appropriate for the background metrics logger functionality.


332-338: LGTM!

The conditional initialization correctly gates the metrics logger feature on both the enable flag and async engine requirement.

@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 18, 2025
Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>
@youngeunkwon0405 youngeunkwon0405 force-pushed the youngeunk/vllm-info branch 2 times, most recently from 7b53c9a to 39947a2 Compare November 18, 2025 22:46
@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 18, 2025
@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 18, 2025
Signed-off-by: Youngeun Kwon <[email protected]>
@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 19, 2025
Signed-off-by: Youngeun Kwon <[email protected]>
@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 19, 2025
@youngeunkwon0405 youngeunkwon0405 marked this pull request as ready for review November 20, 2025 17:35
@youngeunkwon0405
Copy link
Contributor Author

Hi @parthchadha and @terrykong, this PR is now ready for review and merge.
Can I ask for your review please?

This PR will provide the most critical visibility feature of the vllm performance, the per-worker "inflight-batch-size" timeline and the "pending queue status".

@parthchadha
Copy link
Contributor

Thanks for this feature @youngeunkwon0405 , its super cool to be able to see rollout workload distributed across cluster!

use_deep_gemm: False
num_last_layers_in_bf16: 0
num_first_layers_in_bf16: 0
enable_vllm_metrics_logger: true # Set to true to enable vLLM internal metrics logger, turn off for better performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how much performance hit do we take with metrics enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't noticeable to me. But, theoretically, it can have some overhead. I will run some iso-config runs in our perf tracker that has exposed generation time and see how much perf impact there is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the perf difference in QWEN3 30B model. There was no noticeable perf difference.

Signed-off-by: Youngeun Kwon <[email protected]>
@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 20, 2025
@youngeunkwon0405
Copy link
Contributor Author

Hi @parthchadha, all your initial comments are addressed. Can I ask for your final review and potentially a merge if everything looks good to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests enhancement New feature or request Performance Related to improving performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants