Add CloudXRLauncher for programmatic runtime lifecycle management by rwiltz · Pull Request #353 · NVIDIA/IsaacTeleop

rwiltz · 2026-04-03T19:39:07Z

Extracts the CloudXR runtime + WSS proxy launch logic from main.py into a reusable CloudXRLauncher class that embedding applications (e.g. Isaac Lab Teleop) can drive without a separate terminal.
CloudXRLauncher provides start()/stop() and context-manager APIs to manage the full runtime lifecycle programmatically
Runtime is spawned via subprocess.Popen with start_new_session=True to isolate it in its own process group, avoiding CUDA context conflicts with the host application
WSS proxy runs in a background daemon thread with a dedicated event loop
Stale runtime cleanup detects and removes leftover IPC sockets before launching
Adds wait_for_runtime_ready_sync as a synchronous polling alternative to the existing async variant
Promotes runtime timeout constants to public module-level symbols
Refactors main.py to delegate to CloudXRLauncher, keeping the CLI as a thin entry point
Adds cloudxr_tests test package (27 tests) covering launcher lifecycle, sentinel polling, process termination escalation, and stale cleanup

Summary by CodeRabbit

New Features
- Introduced CloudXRLauncher class for flexible, reusable runtime lifecycle management supporting both context-manager and explicit control patterns.
Improvements
- Streamlined application startup and shutdown with improved signal handling and runtime readiness coordination.
Tests
- Added comprehensive test suite validating launcher initialization, runtime startup/stop, and lifecycle edge cases.

coderabbitai · 2026-04-03T19:39:16Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c32b822d-bf41-4f40-8ee7-5182734cfff1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request introduces a new CloudXRLauncher class that encapsulates the lifecycle management of CloudXR runtime and WSS TLS proxy components. The launcher replaces the prior async-based orchestration in __main__.py with a class-driven approach supporting both context-manager and explicit start()/stop() patterns. Supporting utilities are added to runtime.py for synchronous readiness polling. A comprehensive test suite is introduced under a new cloudxr_tests directory with pytest configuration and unit tests for launcher and runtime functionality.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Main as __main__.py
    participant Launcher as CloudXRLauncher
    participant Runtime as Runtime Process
    participant WSS as WSS Proxy
    participant FS as Filesystem

    User->>Main: Invoke program
    Main->>Launcher: start()
    Launcher->>Launcher: Build EnvConfig
    Launcher->>Launcher: Check/accept EULA
    Launcher->>FS: Ensure logs dir exists
    Launcher->>Launcher: Cleanup stale artifacts
    Launcher->>Runtime: subprocess.Popen (start_new_session)
    Launcher->>Runtime: wait_for_runtime_ready_sync()
    Runtime->>FS: Write runtime_started sentinel
    FS->>Launcher: Sentinel detected
    Launcher->>WSS: Start WSS in daemon thread
    Launcher->>Launcher: Register stop() via atexit
    Main->>Launcher: signal.pause() (await shutdown)
    User->>Main: SIGINT/SIGTERM
    Main->>Launcher: stop()
    Launcher->>WSS: Signal stop via stop_future
    Launcher->>Runtime: os.killpg(SIGTERM)
    Launcher->>Runtime: Wait for termination
    alt Timeout
        Launcher->>Runtime: os.killpg(SIGKILL)
    end
    Runtime->>Main: Terminated
    Main->>User: Exit

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.30% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add CloudXRLauncher for programmatic runtime lifecycle management' accurately summarizes the main change: introducing a new CloudXRLauncher class that enables programmatic lifecycle management of the CloudXR runtime.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch rwiltz/launch-cxr-runtime

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/core/cloudxr_tests/python/pyproject.toml`:
- Around line 10-16: The pyproject.toml test extras are missing the runtime
dependency required to import isaacteleop.cloudxr; update the
[project.optional-dependencies] entries for dev and test to include numpy with
the same constraint used elsewhere (e.g. "numpy>=1.22.2") so that creating the
isolated test environment installs numpy and allows imports of
isaacteleop.cloudxr to succeed.

In `@src/core/cloudxr/python/__main__.py`:
- Around line 43-49: Wrap the CloudXRLauncher lifetime in a context manager so
cleanup always runs: replace the plain construction plus
launcher.start()/status/log calls with a "with CloudXRLauncher(... ) as
launcher:" block, move launcher.start() and all subsequent status/log
interactions (the calls currently after start() through the shutdown/cleanup
section) inside that with-block, and rely on CloudXRLauncher.__exit__ to call
launcher.stop() so the runtime/WSS is guaranteed to be stopped on exceptions or
interrupts.
- Around line 71-82: The loop uses signal.pause() which blocks until a signal
and thus never notices when launcher.is_running becomes False; change the loop
to poll instead of blocking: in the while not stop and launcher.is_running loop
(symbols: stop, on_signal, signal.pause(), launcher.is_running) replace
signal.pause() with a short sleep or a nonblocking wait (e.g., time.sleep(0.1)
or calling a launcher.wait(timeout=...) / launcher.join with timeout if
available) so the loop wakes periodically to re-check launcher.is_running and
exit promptly when the subprocess dies.

In `@src/core/cloudxr/python/launcher.py`:
- Around line 124-125: In start(), avoid allowing check_eula() to call sys.exit
by wrapping the call to check_eula(accept_eula=...) in a try/except that catches
SystemExit and re-raises a regular exception (e.g., RuntimeError or a custom
EulaRejectedError) so the embedding process can handle the failure instead of
terminating; update the start() function where
EnvConfig.from_args(self._install_dir, self._env_config) is used and replace the
direct call to check_eula with the try/except re-raise behavior referencing
check_eula and start.
- Around line 197-219: The code only removes the "runtime_started" sentinel when
an IPC socket exists, which can leave a stale runtime_started file after a crash
and cause wait_for_runtime_ready_sync() and start() to mis-report readiness;
update launcher.py so that removal of the "runtime_started" sentinel (and
optionally "ipc_cloudxr") is performed unconditionally before launch: check for
os.path.exists(os.path.join(run_dir, "runtime_started")) and remove it (handling
FileNotFoundError), or move the existing for-loop that removes ("ipc_cloudxr",
"runtime_started") out of the ipc_socket conditional so runtime_started is
always cleared before calling wait_for_runtime_ready_sync() / start().
- Around line 163-169: The code nulls self._runtime_proc and logs "stopped" even
if _terminate_runtime() failed, losing the PID and causing is_running() to
incorrectly report False; change the flow in the block that calls
self._terminate_runtime() so you only set self._runtime_proc = None and log
"CloudXR runtime process stopped" after confirming the process is actually gone
(e.g., by checking the process object via poll()/wait()/is_running() or by
having _terminate_runtime() return success), and if _terminate_runtime() raises
keep the existing self._runtime_proc intact (and re-raise or return a failure)
so the launcher retains the handle to clean up later; update references to
_terminate_runtime, _runtime_proc, and is_running accordingly.
- Around line 274-283: The WSS thread can close its event loop and then
_stop_wss_proxy() calls loop.call_soon_threadsafe(), causing RuntimeError("Event
loop is closed"); update _stop_wss_proxy() to first check the loop state (use
loop.is_closed()) and avoid scheduling on a closed loop, and additionally wrap
call_soon_threadsafe() in a try/except RuntimeError to fall back to treating the
WSS runtime as already stopped (e.g., set/resolve stop_future or proceed to
runtime shutdown). Also mirror this defensive check and exception handling
wherever you schedule on that loop (see similar scheduling sites around lines
290-300) so that a dead WSS thread is treated as stopped rather than crashing
stop().
- Around line 146-154: The code can leave the runtime orphaned if
_start_wss_proxy() raises; modify start() so the atexit hook is registered (set
self._atexit_registered=True) before attempting to start the WSS proxy or wrap
the _start_wss_proxy(wss_log_path) call in a try/except that calls self.stop()
and ensures the atexit hook is registered, then re-raises the exception;
specifically update the block around _start_wss_proxy, logger.info, and the
atexit registration to either (a) register atexit/mark _atexit_registered before
calling _start_wss_proxy, or (b) catch exceptions from _start_wss_proxy, call
self.stop(), register atexit if not set, and re-raise so the runtime is cleaned
up on synchronous startup failures.

In `@src/core/CMakeLists.txt`:
- Around line 68-69: Wrap the CloudXR test directory inclusion behind the
BUILD_PYTHON_BINDINGS option so tests are only registered when bindings are
built: replace or surround the add_subdirectory(cloudxr_tests) call with a
conditional check on BUILD_PYTHON_BINDINGS (optionally combined with
BUILD_TESTING) in src/core/CMakeLists.txt, e.g. use if(BUILD_PYTHON_BINDINGS)
... add_subdirectory(cloudxr_tests) ... endif() to prevent CTest from adding
cloudxr_* cases when bindings are disabled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dd4178da-5c72-490b-bdfb-47e57be79dfc

📥 Commits

Reviewing files that changed from the base of the PR and between 52dd3f4 and 1de0c42.

📒 Files selected for processing (11)

src/core/CMakeLists.txt
src/core/cloudxr/python/CMakeLists.txt
src/core/cloudxr/python/__init__.py
src/core/cloudxr/python/__main__.py
src/core/cloudxr/python/launcher.py
src/core/cloudxr/python/runtime.py
src/core/cloudxr_tests/CMakeLists.txt
src/core/cloudxr_tests/python/CMakeLists.txt
src/core/cloudxr_tests/python/pyproject.toml
src/core/cloudxr_tests/python/test_launcher.py
src/core/cloudxr_tests/python/test_runtime.py

src/core/cloudxr_tests/python/pyproject.toml

src/core/cloudxr/python/__main__.py

src/core/cloudxr/python/launcher.py

src/core/CMakeLists.txt

src/core/cloudxr/python/__main__.py

src/core/cloudxr/python/runtime.py

src/core/cloudxr/python/launcher.py

src/core/cloudxr_tests/python/test_runtime.py

src/core/cloudxr/python/__main__.py

deps/cloudxr/docker-compose.runtime.yaml

jiwenc-nv · 2026-04-09T05:55:10Z

Code review

Found 4 issues:

asyncio.run() in wait_for_runtime_ready_sync() will raise RuntimeError: This event loop is already running when called from Isaac Lab or Isaac Sim, which run persistent asyncio event loops in the main thread. The CloudXRLauncher docstring explicitly documents embedding in these frameworks as the primary use case, making this a guaranteed failure for the stated embedding API.

IsaacTeleop/src/core/cloudxr/python/runtime.py

Lines 127 to 131 in 389d45c

    
                   if the process exits early. 
        
               """ 
        
               return asyncio.run( 
        
                   wait_for_runtime_ready(is_process_alive, timeout_sec, poll_interval_sec) 
        
               )

TOCTOU race in _terminate_runtime(): between proc.poll() returning None and os.getpgid(proc.pid) being called, the process can exit. os.getpgid() on a nonexistent PID raises ProcessLookupError, which is not caught at that call site — only the os.killpg() call below has a ProcessLookupError handler. This propagates uncaught out of stop(), which only catches RuntimeError.

IsaacTeleop/src/core/cloudxr/python/launcher.py

Lines 325 to 331 in 389d45c

    
           if proc is None or proc.poll() is not None: 
        
               return 
        
           pgid = os.getpgid(proc.pid) 
        
           try: 
        
               os.killpg(pgid, signal.SIGTERM)

stderr=subprocess.PIPE without draining the pipe during the 30-second startup wait. wait_for_runtime_ready_sync() polls a sentinel file but never reads from the pipe. If the subprocess writes more than ~64 KB to stderr (plausible with verbose Vulkan/GPU init output), the OS pipe buffer fills, the subprocess deadlocks on its next write, and the sentinel file is never written — producing a misleading timeout failure whose root cause is hidden.

IsaacTeleop/src/core/cloudxr/python/launcher.py

Lines 119 to 125 in 389d45c

    
           self._runtime_proc = subprocess.Popen( 
        
               [sys.executable, "-c", _RUNTIME_WORKER_CODE], 
        
               env=os.environ.copy(), 
        
               stderr=subprocess.PIPE, 
        
               start_new_session=True, 
        
           ) 
        
           logger.info("CloudXR runtime process started (pid=%s)", self._runtime_proc.pid)

In stop(), if _terminate_runtime() raises RuntimeError, it is re-raised before self._runtime_proc = None is reached. The atexit handler registered during init will then call stop() again at process exit with _runtime_proc still set, attempting to re-terminate an already-dead process. Combined with issue 2 above, os.getpgid() on the stale PID raises an uncaught ProcessLookupError from the atexit handler.

IsaacTeleop/src/core/cloudxr/python/launcher.py

Lines 169 to 182 in 389d45c

    
           if self._runtime_proc is not None: 
        
               try: 
        
                   self._terminate_runtime() 
        
               except RuntimeError: 
        
                   logger.warning( 
        
                       "Failed to cleanly terminate CloudXR runtime process (pid=%s); " 
        
                       "handle retained for later cleanup", 
        
                       self._runtime_proc.pid, 
        
                   ) 
        
                   raise 
        
               self._runtime_proc = None 
        
               logger.info("CloudXR runtime process stopped")

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

rwiltz · 2026-04-09T14:25:35Z

Code review

Found 4 issues:

asyncio.run() in wait_for_runtime_ready_sync() will raise RuntimeError: This event loop is already running when called from Isaac Lab or Isaac Sim, which run persistent asyncio event loops in the main thread. The CloudXRLauncher docstring explicitly documents embedding in these frameworks as the primary use case, making this a guaranteed failure for the stated embedding API.

IsaacTeleop/src/core/cloudxr/python/runtime.py

Lines 127 to 131 in 389d45c

if the process exits early.

"""

return asyncio.run(

wait_for_runtime_ready(is_process_alive, timeout_sec, poll_interval_sec)

)

TOCTOU race in _terminate_runtime(): between proc.poll() returning None and os.getpgid(proc.pid) being called, the process can exit. os.getpgid() on a nonexistent PID raises ProcessLookupError, which is not caught at that call site — only the os.killpg() call below has a ProcessLookupError handler. This propagates uncaught out of stop(), which only catches RuntimeError.

IsaacTeleop/src/core/cloudxr/python/launcher.py

Lines 325 to 331 in 389d45c

if proc is None or proc.poll() is not None:

return

pgid = os.getpgid(proc.pid)

try:

os.killpg(pgid, signal.SIGTERM)

stderr=subprocess.PIPE without draining the pipe during the 30-second startup wait. wait_for_runtime_ready_sync() polls a sentinel file but never reads from the pipe. If the subprocess writes more than ~64 KB to stderr (plausible with verbose Vulkan/GPU init output), the OS pipe buffer fills, the subprocess deadlocks on its next write, and the sentinel file is never written — producing a misleading timeout failure whose root cause is hidden.

IsaacTeleop/src/core/cloudxr/python/launcher.py

Lines 119 to 125 in 389d45c

self._runtime_proc = subprocess.Popen(

[sys.executable, "-c", _RUNTIME_WORKER_CODE],

env=os.environ.copy(),

stderr=subprocess.PIPE,

start_new_session=True,

)

logger.info("CloudXR runtime process started (pid=%s)", self._runtime_proc.pid)

In stop(), if _terminate_runtime() raises RuntimeError, it is re-raised before self._runtime_proc = None is reached. The atexit handler registered during init will then call stop() again at process exit with _runtime_proc still set, attempting to re-terminate an already-dead process. Combined with issue 2 above, os.getpgid() on the stale PID raises an uncaught ProcessLookupError from the atexit handler.

IsaacTeleop/src/core/cloudxr/python/launcher.py

Lines 169 to 182 in 389d45c

if self._runtime_proc is not None:

try:

self._terminate_runtime()

except RuntimeError:

logger.warning(

"Failed to cleanly terminate CloudXR runtime process (pid=%s); "

"handle retained for later cleanup",

self._runtime_proc.pid,

)

raise

self._runtime_proc = None

logger.info("CloudXR runtime process stopped")

🤖 Generated with Claude Code

If this code review was useful, please react with 👍. Otherwise, react with 👎.

Thanks for the review. Here's how I'm addressing each point:

asyncio.run() in wait_for_runtime_ready_sync() — Good catch. Rewrote to a plain time.monotonic() / time.sleep() polling loop, so it's safe to call from contexts with a running event loop (Kit, Isaac Sim, etc.). The async version is preserved for async callers.
TOCTOU race in _terminate_runtime() — Fair point. Wrapped os.getpgid() in try/except ProcessLookupError to handle the race between poll() and getpgid().
stderr=subprocess.PIPE without draining — I believe this is a non-issue in practice. The runtime's run() function redirects stderr to a file via os.dup2() early in startup (when NV_CXR_FILE_LOGGING is enabled, which is the default). After the dup2, the pipe write-end is effectively closed in the child — no further output flows through it. The pre-redirect window only sees Python imports and env validation, nowhere near 64KB. The pipe exists intentionally so _collect_startup_failure_detail can capture startup tracebacks.
atexit retry with stale PID — The handle retention is by design (documented in the log message). The atexit retry is desirable — it gives the process another chance to terminate. The real risk was the unfixed TOCTOU from # 2, which is now addressed. Even in the worst case, an exception during atexit is a printed warning, not a crash.

.github/workflows/build-ubuntu.yml

rwiltz requested review from jiwenc-nv and yanziz-nvidia April 3, 2026 19:39

rwiltz temporarily deployed to dev April 3, 2026 19:53 — with GitHub Actions Inactive

rwiltz force-pushed the rwiltz/launch-cxr-runtime branch 2 times, most recently from 6f39cf4 to 2f2bf70 Compare April 7, 2026 21:27

rwiltz changed the title ~~Working launcher.py script~~ Add CloudXRLauncher for programmatic runtime lifecycle management Apr 7, 2026

rwiltz marked this pull request as ready for review April 7, 2026 21:29

rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 2f2bf70 to 2c6bca1 Compare April 7, 2026 21:30

rwiltz requested review from aristarkhovNV and nvddr April 7, 2026 21:30

rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 2c6bca1 to 1de0c42 Compare April 7, 2026 21:33

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 1de0c42 to 64835f7 Compare April 8, 2026 03:45

rwiltz temporarily deployed to dev April 8, 2026 04:13 — with GitHub Actions Inactive

Working launcher.py script

4376693

rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 64835f7 to 4376693 Compare April 8, 2026 15:30

Addressing review bot feedback

cbc717c

rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 1c15134 to cbc717c Compare April 8, 2026 16:43

aristarkhovNV reviewed Apr 8, 2026

View reviewed changes

Addressing feedback

b024308

aristarkhovNV approved these changes Apr 8, 2026

View reviewed changes

src/core/cloudxr/python/__main__.py Outdated Show resolved Hide resolved

yanziz-nvidia approved these changes Apr 8, 2026

View reviewed changes

deps/cloudxr/docker-compose.runtime.yaml Outdated Show resolved Hide resolved

consolidating start() into the constructor and extending test timeout

4b6fdb7

rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from b188279 to 4b6fdb7 Compare April 8, 2026 20:19

Attempt to capture cxr runtime failure on arm CI

6a97bd9

rwiltz force-pushed the rwiltz/launch-cxr-runtime branch 2 times, most recently from 87c0965 to 7d0a5f1 Compare April 8, 2026 22:02

Fix arm CI failure

389d45c

rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 7d0a5f1 to 389d45c Compare April 9, 2026 00:48

rwiltz added 2 commits April 9, 2026 10:31

Addressing feedback

bc39610

Fix ros container port contention

020c003

rwiltz force-pushed the rwiltz/launch-cxr-runtime branch from 3fbcf39 to 020c003 Compare April 9, 2026 15:29

rwiltz commented Apr 9, 2026

View reviewed changes

.github/workflows/build-ubuntu.yml Show resolved Hide resolved

rwiltz deployed to dev April 9, 2026 16:23 — with GitHub Actions Active

jiwenc-nv approved these changes Apr 9, 2026

View reviewed changes

.github/workflows/build-ubuntu.yml Show resolved Hide resolved

rwiltz merged commit 18c8e36 into main Apr 9, 2026
83 of 101 checks passed

rwiltz deleted the rwiltz/launch-cxr-runtime branch April 9, 2026 18:13

Conversation

rwiltz commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiwenc-nv commented Apr 9, 2026

Code review

Uh oh!

rwiltz commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rwiltz commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

rwiltz commented Apr 9, 2026 •

edited

Loading