Skip to content

Add CloudXRLauncher for programmatic runtime lifecycle management#353

Merged
rwiltz merged 8 commits intomainfrom
rwiltz/launch-cxr-runtime
Apr 9, 2026
Merged

Add CloudXRLauncher for programmatic runtime lifecycle management#353
rwiltz merged 8 commits intomainfrom
rwiltz/launch-cxr-runtime

Conversation

@rwiltz
Copy link
Copy Markdown
Contributor

@rwiltz rwiltz commented Apr 3, 2026

Extracts the CloudXR runtime + WSS proxy launch logic from main.py into a reusable CloudXRLauncher class that embedding applications (e.g. Isaac Lab Teleop) can drive without a separate terminal.
CloudXRLauncher provides start()/stop() and context-manager APIs to manage the full runtime lifecycle programmatically
Runtime is spawned via subprocess.Popen with start_new_session=True to isolate it in its own process group, avoiding CUDA context conflicts with the host application
WSS proxy runs in a background daemon thread with a dedicated event loop
Stale runtime cleanup detects and removes leftover IPC sockets before launching
Adds wait_for_runtime_ready_sync as a synchronous polling alternative to the existing async variant
Promotes runtime timeout constants to public module-level symbols
Refactors main.py to delegate to CloudXRLauncher, keeping the CLI as a thin entry point
Adds cloudxr_tests test package (27 tests) covering launcher lifecycle, sentinel polling, process termination escalation, and stale cleanup

Summary by CodeRabbit

  • New Features

    • Introduced CloudXRLauncher class for flexible, reusable runtime lifecycle management supporting both context-manager and explicit control patterns.
  • Improvements

    • Streamlined application startup and shutdown with improved signal handling and runtime readiness coordination.
  • Tests

    • Added comprehensive test suite validating launcher initialization, runtime startup/stop, and lifecycle edge cases.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c32b822d-bf41-4f40-8ee7-5182734cfff1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces a new CloudXRLauncher class that encapsulates the lifecycle management of CloudXR runtime and WSS TLS proxy components. The launcher replaces the prior async-based orchestration in __main__.py with a class-driven approach supporting both context-manager and explicit start()/stop() patterns. Supporting utilities are added to runtime.py for synchronous readiness polling. A comprehensive test suite is introduced under a new cloudxr_tests directory with pytest configuration and unit tests for launcher and runtime functionality.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Main as __main__.py
    participant Launcher as CloudXRLauncher
    participant Runtime as Runtime Process
    participant WSS as WSS Proxy
    participant FS as Filesystem

    User->>Main: Invoke program
    Main->>Launcher: start()
    Launcher->>Launcher: Build EnvConfig
    Launcher->>Launcher: Check/accept EULA
    Launcher->>FS: Ensure logs dir exists
    Launcher->>Launcher: Cleanup stale artifacts
    Launcher->>Runtime: subprocess.Popen (start_new_session)
    Launcher->>Runtime: wait_for_runtime_ready_sync()
    Runtime->>FS: Write runtime_started sentinel
    FS->>Launcher: Sentinel detected
    Launcher->>WSS: Start WSS in daemon thread
    Launcher->>Launcher: Register stop() via atexit
    Main->>Launcher: signal.pause() (await shutdown)
    User->>Main: SIGINT/SIGTERM
    Main->>Launcher: stop()
    Launcher->>WSS: Signal stop via stop_future
    Launcher->>Runtime: os.killpg(SIGTERM)
    Launcher->>Runtime: Wait for termination
    alt Timeout
        Launcher->>Runtime: os.killpg(SIGKILL)
    end
    Runtime->>Main: Terminated
    Main->>User: Exit
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add CloudXRLauncher for programmatic runtime lifecycle management' accurately summarizes the main change: introducing a new CloudXRLauncher class that enables programmatic lifecycle management of the CloudXR runtime.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch rwiltz/launch-cxr-runtime

Comment @coderabbitai help to get the list of available commands and usage tips.

@rwiltz rwiltz force-pushed the rwiltz/launch-cxr-runtime branch 2 times, most recently from 6f39cf4 to 2f2bf70 Compare April 7, 2026 21:27
@rwiltz rwiltz changed the title Working launcher.py script Add CloudXRLauncher for programmatic runtime lifecycle management Apr 7, 2026
@rwiltz rwiltz marked this pull request as ready for review April 7, 2026 21:29
@rwiltz rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 2f2bf70 to 2c6bca1 Compare April 7, 2026 21:30
@rwiltz rwiltz requested review from aristarkhovNV and nvddr April 7, 2026 21:30
@rwiltz rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 2c6bca1 to 1de0c42 Compare April 7, 2026 21:33
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/core/cloudxr_tests/python/pyproject.toml`:
- Around line 10-16: The pyproject.toml test extras are missing the runtime
dependency required to import isaacteleop.cloudxr; update the
[project.optional-dependencies] entries for dev and test to include numpy with
the same constraint used elsewhere (e.g. "numpy>=1.22.2") so that creating the
isolated test environment installs numpy and allows imports of
isaacteleop.cloudxr to succeed.

In `@src/core/cloudxr/python/__main__.py`:
- Around line 43-49: Wrap the CloudXRLauncher lifetime in a context manager so
cleanup always runs: replace the plain construction plus
launcher.start()/status/log calls with a "with CloudXRLauncher(... ) as
launcher:" block, move launcher.start() and all subsequent status/log
interactions (the calls currently after start() through the shutdown/cleanup
section) inside that with-block, and rely on CloudXRLauncher.__exit__ to call
launcher.stop() so the runtime/WSS is guaranteed to be stopped on exceptions or
interrupts.
- Around line 71-82: The loop uses signal.pause() which blocks until a signal
and thus never notices when launcher.is_running becomes False; change the loop
to poll instead of blocking: in the while not stop and launcher.is_running loop
(symbols: stop, on_signal, signal.pause(), launcher.is_running) replace
signal.pause() with a short sleep or a nonblocking wait (e.g., time.sleep(0.1)
or calling a launcher.wait(timeout=...) / launcher.join with timeout if
available) so the loop wakes periodically to re-check launcher.is_running and
exit promptly when the subprocess dies.

In `@src/core/cloudxr/python/launcher.py`:
- Around line 124-125: In start(), avoid allowing check_eula() to call sys.exit
by wrapping the call to check_eula(accept_eula=...) in a try/except that catches
SystemExit and re-raises a regular exception (e.g., RuntimeError or a custom
EulaRejectedError) so the embedding process can handle the failure instead of
terminating; update the start() function where
EnvConfig.from_args(self._install_dir, self._env_config) is used and replace the
direct call to check_eula with the try/except re-raise behavior referencing
check_eula and start.
- Around line 197-219: The code only removes the "runtime_started" sentinel when
an IPC socket exists, which can leave a stale runtime_started file after a crash
and cause wait_for_runtime_ready_sync() and start() to mis-report readiness;
update launcher.py so that removal of the "runtime_started" sentinel (and
optionally "ipc_cloudxr") is performed unconditionally before launch: check for
os.path.exists(os.path.join(run_dir, "runtime_started")) and remove it (handling
FileNotFoundError), or move the existing for-loop that removes ("ipc_cloudxr",
"runtime_started") out of the ipc_socket conditional so runtime_started is
always cleared before calling wait_for_runtime_ready_sync() / start().
- Around line 163-169: The code nulls self._runtime_proc and logs "stopped" even
if _terminate_runtime() failed, losing the PID and causing is_running() to
incorrectly report False; change the flow in the block that calls
self._terminate_runtime() so you only set self._runtime_proc = None and log
"CloudXR runtime process stopped" after confirming the process is actually gone
(e.g., by checking the process object via poll()/wait()/is_running() or by
having _terminate_runtime() return success), and if _terminate_runtime() raises
keep the existing self._runtime_proc intact (and re-raise or return a failure)
so the launcher retains the handle to clean up later; update references to
_terminate_runtime, _runtime_proc, and is_running accordingly.
- Around line 274-283: The WSS thread can close its event loop and then
_stop_wss_proxy() calls loop.call_soon_threadsafe(), causing RuntimeError("Event
loop is closed"); update _stop_wss_proxy() to first check the loop state (use
loop.is_closed()) and avoid scheduling on a closed loop, and additionally wrap
call_soon_threadsafe() in a try/except RuntimeError to fall back to treating the
WSS runtime as already stopped (e.g., set/resolve stop_future or proceed to
runtime shutdown). Also mirror this defensive check and exception handling
wherever you schedule on that loop (see similar scheduling sites around lines
290-300) so that a dead WSS thread is treated as stopped rather than crashing
stop().
- Around line 146-154: The code can leave the runtime orphaned if
_start_wss_proxy() raises; modify start() so the atexit hook is registered (set
self._atexit_registered=True) before attempting to start the WSS proxy or wrap
the _start_wss_proxy(wss_log_path) call in a try/except that calls self.stop()
and ensures the atexit hook is registered, then re-raises the exception;
specifically update the block around _start_wss_proxy, logger.info, and the
atexit registration to either (a) register atexit/mark _atexit_registered before
calling _start_wss_proxy, or (b) catch exceptions from _start_wss_proxy, call
self.stop(), register atexit if not set, and re-raise so the runtime is cleaned
up on synchronous startup failures.

In `@src/core/CMakeLists.txt`:
- Around line 68-69: Wrap the CloudXR test directory inclusion behind the
BUILD_PYTHON_BINDINGS option so tests are only registered when bindings are
built: replace or surround the add_subdirectory(cloudxr_tests) call with a
conditional check on BUILD_PYTHON_BINDINGS (optionally combined with
BUILD_TESTING) in src/core/CMakeLists.txt, e.g. use if(BUILD_PYTHON_BINDINGS)
... add_subdirectory(cloudxr_tests) ... endif() to prevent CTest from adding
cloudxr_* cases when bindings are disabled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dd4178da-5c72-490b-bdfb-47e57be79dfc

📥 Commits

Reviewing files that changed from the base of the PR and between 52dd3f4 and 1de0c42.

📒 Files selected for processing (11)
  • src/core/CMakeLists.txt
  • src/core/cloudxr/python/CMakeLists.txt
  • src/core/cloudxr/python/__init__.py
  • src/core/cloudxr/python/__main__.py
  • src/core/cloudxr/python/launcher.py
  • src/core/cloudxr/python/runtime.py
  • src/core/cloudxr_tests/CMakeLists.txt
  • src/core/cloudxr_tests/python/CMakeLists.txt
  • src/core/cloudxr_tests/python/pyproject.toml
  • src/core/cloudxr_tests/python/test_launcher.py
  • src/core/cloudxr_tests/python/test_runtime.py

@rwiltz rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 64835f7 to 4376693 Compare April 8, 2026 15:30
@rwiltz rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 1c15134 to cbc717c Compare April 8, 2026 16:43
@rwiltz rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from b188279 to 4b6fdb7 Compare April 8, 2026 20:19
@rwiltz rwiltz force-pushed the rwiltz/launch-cxr-runtime branch 2 times, most recently from 87c0965 to 7d0a5f1 Compare April 8, 2026 22:02
@rwiltz rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 7d0a5f1 to 389d45c Compare April 9, 2026 00:48
@jiwenc-nv
Copy link
Copy Markdown
Collaborator

Code review

Found 4 issues:

  1. asyncio.run() in wait_for_runtime_ready_sync() will raise RuntimeError: This event loop is already running when called from Isaac Lab or Isaac Sim, which run persistent asyncio event loops in the main thread. The CloudXRLauncher docstring explicitly documents embedding in these frameworks as the primary use case, making this a guaranteed failure for the stated embedding API.

if the process exits early.
"""
return asyncio.run(
wait_for_runtime_ready(is_process_alive, timeout_sec, poll_interval_sec)
)

  1. TOCTOU race in _terminate_runtime(): between proc.poll() returning None and os.getpgid(proc.pid) being called, the process can exit. os.getpgid() on a nonexistent PID raises ProcessLookupError, which is not caught at that call site — only the os.killpg() call below has a ProcessLookupError handler. This propagates uncaught out of stop(), which only catches RuntimeError.

if proc is None or proc.poll() is not None:
return
pgid = os.getpgid(proc.pid)
try:
os.killpg(pgid, signal.SIGTERM)

  1. stderr=subprocess.PIPE without draining the pipe during the 30-second startup wait. wait_for_runtime_ready_sync() polls a sentinel file but never reads from the pipe. If the subprocess writes more than ~64 KB to stderr (plausible with verbose Vulkan/GPU init output), the OS pipe buffer fills, the subprocess deadlocks on its next write, and the sentinel file is never written — producing a misleading timeout failure whose root cause is hidden.

self._runtime_proc = subprocess.Popen(
[sys.executable, "-c", _RUNTIME_WORKER_CODE],
env=os.environ.copy(),
stderr=subprocess.PIPE,
start_new_session=True,
)
logger.info("CloudXR runtime process started (pid=%s)", self._runtime_proc.pid)

  1. In stop(), if _terminate_runtime() raises RuntimeError, it is re-raised before self._runtime_proc = None is reached. The atexit handler registered during init will then call stop() again at process exit with _runtime_proc still set, attempting to re-terminate an already-dead process. Combined with issue 2 above, os.getpgid() on the stale PID raises an uncaught ProcessLookupError from the atexit handler.

if self._runtime_proc is not None:
try:
self._terminate_runtime()
except RuntimeError:
logger.warning(
"Failed to cleanly terminate CloudXR runtime process (pid=%s); "
"handle retained for later cleanup",
self._runtime_proc.pid,
)
raise
self._runtime_proc = None
logger.info("CloudXR runtime process stopped")

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@rwiltz
Copy link
Copy Markdown
Contributor Author

rwiltz commented Apr 9, 2026

Code review

Found 4 issues:

  1. asyncio.run() in wait_for_runtime_ready_sync() will raise RuntimeError: This event loop is already running when called from Isaac Lab or Isaac Sim, which run persistent asyncio event loops in the main thread. The CloudXRLauncher docstring explicitly documents embedding in these frameworks as the primary use case, making this a guaranteed failure for the stated embedding API.

if the process exits early.
"""
return asyncio.run(
wait_for_runtime_ready(is_process_alive, timeout_sec, poll_interval_sec)
)

  1. TOCTOU race in _terminate_runtime(): between proc.poll() returning None and os.getpgid(proc.pid) being called, the process can exit. os.getpgid() on a nonexistent PID raises ProcessLookupError, which is not caught at that call site — only the os.killpg() call below has a ProcessLookupError handler. This propagates uncaught out of stop(), which only catches RuntimeError.

if proc is None or proc.poll() is not None:
return
pgid = os.getpgid(proc.pid)
try:
os.killpg(pgid, signal.SIGTERM)

  1. stderr=subprocess.PIPE without draining the pipe during the 30-second startup wait. wait_for_runtime_ready_sync() polls a sentinel file but never reads from the pipe. If the subprocess writes more than ~64 KB to stderr (plausible with verbose Vulkan/GPU init output), the OS pipe buffer fills, the subprocess deadlocks on its next write, and the sentinel file is never written — producing a misleading timeout failure whose root cause is hidden.

self._runtime_proc = subprocess.Popen(
[sys.executable, "-c", _RUNTIME_WORKER_CODE],
env=os.environ.copy(),
stderr=subprocess.PIPE,
start_new_session=True,
)
logger.info("CloudXR runtime process started (pid=%s)", self._runtime_proc.pid)

  1. In stop(), if _terminate_runtime() raises RuntimeError, it is re-raised before self._runtime_proc = None is reached. The atexit handler registered during init will then call stop() again at process exit with _runtime_proc still set, attempting to re-terminate an already-dead process. Combined with issue 2 above, os.getpgid() on the stale PID raises an uncaught ProcessLookupError from the atexit handler.

if self._runtime_proc is not None:
try:
self._terminate_runtime()
except RuntimeError:
logger.warning(
"Failed to cleanly terminate CloudXR runtime process (pid=%s); "
"handle retained for later cleanup",
self._runtime_proc.pid,
)
raise
self._runtime_proc = None
logger.info("CloudXR runtime process stopped")

🤖 Generated with Claude Code

  • If this code review was useful, please react with 👍. Otherwise, react with 👎.

Thanks for the review. Here's how I'm addressing each point:

  1. asyncio.run() in wait_for_runtime_ready_sync() — Good catch. Rewrote to a plain time.monotonic() / time.sleep() polling loop, so it's safe to call from contexts with a running event loop (Kit, Isaac Sim, etc.). The async version is preserved for async callers.

  2. TOCTOU race in _terminate_runtime() — Fair point. Wrapped os.getpgid() in try/except ProcessLookupError to handle the race between poll() and getpgid().

  3. stderr=subprocess.PIPE without draining — I believe this is a non-issue in practice. The runtime's run() function redirects stderr to a file via os.dup2() early in startup (when NV_CXR_FILE_LOGGING is enabled, which is the default). After the dup2, the pipe write-end is effectively closed in the child — no further output flows through it. The pre-redirect window only sees Python imports and env validation, nowhere near 64KB. The pipe exists intentionally so _collect_startup_failure_detail can capture startup tracebacks.

  4. atexit retry with stale PID — The handle retention is by design (documented in the log message). The atexit retry is desirable — it gives the process another chance to terminate. The real risk was the unfixed TOCTOU from # 2, which is now addressed. Even in the worst case, an exception during atexit is a printed warning, not a crash.

@rwiltz rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 3fbcf39 to 020c003 Compare April 9, 2026 15:29
@rwiltz rwiltz merged commit 18c8e36 into main Apr 9, 2026
83 of 101 checks passed
@rwiltz rwiltz deleted the rwiltz/launch-cxr-runtime branch April 9, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants