Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Segmentation fault #1133

Open
niehen6174 opened this issue Nov 5, 2024 · 1 comment
Open

[Bug] Segmentation fault #1133

niehen6174 opened this issue Nov 5, 2024 · 1 comment
Labels
Request-bug Something isn't working

Comments

@niehen6174
Copy link

Your current environment information

ibibverbs not available, ibv_fork_init skipped
Collecting environment information...
PyTorch version: 2.1.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OneFlow version: path: ['/opt/conda/lib/python3.10/site-packages/oneflow'], version: 0.9.1.dev20241019+cu118, git_commit: d23c061, cmake_build_type: Release, rdma: True, mlir: True, enterprise: False
Nexfort version: none
OneDiff version: 1.2.1.dev15+g241fe57d
OneDiffX version: none

GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22)
Clang version: Could not collect
CMake version: version 3.30.4
Libc version: glibc-2.28

Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.241-1-tlinux4-0017.7-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L40
Nvidia driver version: 525.125.06
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.9.7
/usr/lib64/libcudnn_adv_infer.so.8.9.7
/usr/lib64/libcudnn_adv_train.so.8.9.7
/usr/lib64/libcudnn_cnn_infer.so.8.9.7
/usr/lib64/libcudnn_cnn_train.so.8.9.7
/usr/lib64/libcudnn_ops_infer.so.8.9.7
/usr/lib64/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 17
Model name: AMD EPYC 9K84 96-Core Processor
Stepping: 0
CPU MHz: 2600.034
BogoMIPS: 5200.06
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-191
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 avx512_bf16 clzero xsaveerptr wbnoinvd arat avx512vbmi umip avx512_vbmi2 vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm

Versions of relevant libraries:
[pip3] diffusers==0.30.3
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnx==1.17.0
[pip3] onnxruntime==1.19.2
[pip3] onnxruntime-gpu==1.18.0
[pip3] open-clip-torch==2.20.0
[pip3] torch==2.1.1
[pip3] torchaudio==2.1.1
[pip3] torchsde==0.2.6
[pip3] torchvision==0.16.1
[pip3] transformers==4.44.2
[pip3] triton==2.1.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.8.0 h6a678d5_0
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.10 py310h5eee18b_0
[conda] mkl_random 1.2.7 py310h1128e8f_0
[conda] numpy 1.26.4 py310h5f9d8c6_0
[conda] numpy-base 1.26.4 py310hb5e798b_0
[conda] open-clip-torch 2.20.0 pypi_0 pypi
[conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 2.1.1 pypi_0 pypi
[conda] torchaudio 2.1.1 py310_cu121 pytorch
[conda] torchsde 0.2.6 pypi_0 pypi
[conda] torchvision 0.16.1 pypi_0 pypi
[conda] triton 2.1.0 pypi_0 pypi

🐛 Describe the bug

I encountered a Segmentation fault issue while using OneDiff in ComfyUI, with no additional error information. I am seeking some assistance.

I encountered two types of Segmentation fault errors:

Stack trace (most recent call last) in thread 256218: 
Segmentation fault (Signal sent by the kernel [(nil)]) 

Stack trace (most recent call last) in thread 417043: 
Segmentation fault (Address not mapped to object [(nil)])

After testing, this error is unrelated to a single workflow and is also unrelated to previously executed workflows. I currently have no leads and have not been able to reproduce it again; the occurrence seems to be quite random.

@niehen6174 niehen6174 added the Request-bug Something isn't working label Nov 5, 2024
@niehen6174
Copy link
Author

This error is likely originating from OneFlow, and I have also seen similar issues in other issues(#393 #1080 ), but none of them have provided a solution.

@strint Could you please take a look at this, or do you have any suggestions for a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Request-bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant