Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workstation Freezes During Training Sessions in OmniIsaacGymEnvs 2023.1.1 #172

Open
Wangshengyang2004 opened this issue Jun 25, 2024 · 0 comments

Comments

@Wangshengyang2004
Copy link

Issue Description

I experience frequent freezes on my workstation during training sessions with OmniIsaacGymEnvs, specifically when training the Crazyflie task with a modified reward function in headless mode using multi-GPU. The workstation freezes necessitate a full reboot. Notably, there is noticeable input lag (mouse and keyboard) prior to these freezes, and the GPUs emit continuous impulse sounds, indicating high activity.
nvidia-bug-report.log.gz

Environment

  • Operating System: Ubuntu 22.04 (latest update)
  • OmniIsaacGymEnvs Version: 2023.1.1
  • Python Version: Python 3.10 (Isaac Sim's Interpreter)
  • Hardware: Intel i9-14900K, 128GB DDR5 RAM, Dual NVIDIA RTX 3090 GPUs, CUDA Driver 555 (version 12.5)

Steps to Reproduce

  1. Run the Crazyflie task with modified reward function in headless mode and multi-GPU setup.
  2. Observe continuous GPU activity and eventual system freeze, requiring a reboot.

Expected Behavior

The system should handle training without significant performance degradation or freezing, as observed on another workstation with lower specifications (Intel Xeon W-2150B, 128GB DDR4 RAM, single RTX A6000 GPU) where only minor lagging occurs without system freezes.

Actual Behavior

The system freezes during training, and I'm unable to interact with any system functions, including the inability to recover the display even after reconnecting the HDMI cable. I attempted to reduce system load by closing applications like Edge Browser, VPN, and VS Code, but the issue persists.

Additional Information

Attempting to update Isaac Sim to version 4.0.0 and use the latest OIGE repo resulted in errors related to CUDA module data unloading, indicating potential compatibility or stability issues with the newer versions:

ed 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,314ms] [Warning] [carb] Recursive unloadAllPlugins() detected!
There was an error running python
(simple_eureka) simonwsy@simonwsy-Z790-UD:~/.local/share/ov/pkg/isaac-sim-4.0.0/OmniIsaacGy

Possible Solutions

  • It seems unlikely that the issue is hardware incompatibility with Ubuntu 22.04 since simultaneous stress tests on CPU and GPU (cpu-burner and gpu-burner) did not replicate the freezing or lagging.
  • Possible issues with process handling in multi-GPU setups, as evidenced by occasional errors from torch.distributed about port allocation, suggesting that processes might not be terminating correctly.
  • Reinstallation of Isaac Sim could be a potential fix, though it does not resolve the issue of increased error frequency over time.

I am looking for guidance on whether this issue is known and if there are recommended settings or configurations that could mitigate these problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant