Running 'example. py' in a Docker container can cause the system to freeze or be forced to shut down #22

ziyaxuanyi · 2024-11-13T12:07:15Z

I am running it in a Docker container and nunkaku has been successfully compiled.

I replaced the schnell model in example. py with the dev model and ran it.
When the program reaches the prompt to load svdq-int4-flux.1-dev.safedetectors and outputs' Done '.
The system is stuck or forced to shut down

Warning of very high system memory usage detected when stuck or forced to shut down

May I ask if the current version of the code requires extremely high memory usage and approximately how much memory is needed
Or whether there are memory management or leakage issues

nitinmukesh · 2024-11-13T13:52:49Z

Are you facing this issue all the time or it is happening occasionally.

I am running on 8GB VRAM / Windows 11 and it happens randomly. It happens during loading of model, if model loads inference is smooth.

ziyaxuanyi · 2024-11-13T14:27:36Z

Are you facing this issue all the time or it is happening occasionally.

I am running on 8GB VRAM / Windows 11 and it happens randomly. It happens during loading of model, if model loads inference is smooth.

always.

nitinmukesh · 2024-11-13T14:46:43Z

Try this
After this code add cpu_offload.

pipeline = nunchaku_flux.from_pretrained(
        "black-forest-labs/FLUX.1-dev",
        torch_dtype=torch.bfloat16,
        qmodel_path="mit-han-lab/svdquant-models/svdq-int4-flux.1-dev.safetensors",
    )

pipeline.enable_sequential_cpu_offload()

ziyaxuanyi · 2024-11-14T01:10:22Z

Try this After this code add cpu_offload.

pipeline = nunchaku_flux.from_pretrained(
        "black-forest-labs/FLUX.1-dev",
        torch_dtype=torch.bfloat16,
        qmodel_path="mit-han-lab/svdquant-models/svdq-int4-flux.1-dev.safetensors",
    )

pipeline.enable_sequential_cpu_offload()

But I don't think it's a problem with the VRAM. My VRAM has 24GB, which is sufficient. It's likely a problem with the CPU

lmxyy · 2024-12-08T06:32:07Z

I see. This issue seems similar to #40 Previously, we instantiated the 16-bit DiT on CPU, so it consumes about 24G RAM. We have optimized the process and now it only consumes ~10G RAM. You can pull the latest code and try again.

ziyaxuanyi · 2024-12-11T08:04:02Z

I see. This issue seems similar to #40 Previously, we instantiated the 16-bit DiT on CPU, so it consumes about 24G RAM. We have optimized the process and now it only consumes ~10G RAM. You can pull the latest code and try again.

Great! It solved my problem. Thank you for your great works!

ziyaxuanyi · 2024-12-12T02:04:30Z

I see. This issue seems similar to #40 Previously, we instantiated the 16-bit DiT on CPU, so it consumes about 24G RAM. We have optimized the process and now it only consumes ~10G RAM. You can pull the latest code and try again.

Great! It solved my problem. Thank you for your great works!

@lmxyy
Unfortunately, I found that the problem still exists.
When I change a machine, the error still occurs under the same environment and code
So far, I have successfully deployed and tested example on only one machine, but it has failed on several other machines, and all of them show this error phenomenon
I think there may be serious bugs in your CUDA level implementation code

sxtyzhangzk · 2024-12-12T21:34:04Z

Hi,

Are you using docker on Windows or native Linux?
Does rebooting solve the problem temporarily?

ziyaxuanyi · 2024-12-13T01:09:33Z

Hi,

Are you using docker on Windows or native Linux? Does rebooting solve the problem temporarily?

using docker on native Linux.
No, it doesn't.

ziyaxuanyi · 2024-12-13T09:18:16Z

I have discovered another issue.
Running 'example. py' will cause the system's log files, /var/log/kern.log and /var/log/messages, to rapidly expand and grow, ultimately exhausting system space and causing container crashes

sxtyzhangzk · 2024-12-13T14:01:39Z

I have discovered another issue. Running 'example. py' will cause the system's log files, /var/log/kern.log and /var/log/messages, to rapidly expand and grow, ultimately exhausting system space and causing container crashes

Good finding. Could you look into kern.log (or using dmesg) and see what is happening with the kernel?

ziyaxuanyi · 2024-12-16T03:25:45Z

I have discovered another issue. Running 'example. py' will cause the system's log files, /var/log/kern.log and /var/log/messages, to rapidly expand and grow, ultimately exhausting system space and causing container crashes

Good finding. Could you look into kern.log (or using dmesg) and see what is happening with the kernel?

Due to work reasons, I can't post the specific log.
I can only describe it briefly as follows:
node-10 kernel:[581963.134759] WARNING: CPU: 95 PID: 74299 at fs/fuse/file.c:1626 fuse_write_file_get.part.24+0x11/0x14 [fuse]
node-10 kernel:[581963.134760] Modules linked in: nvidia_uvm(OE) nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) ipt_rpfilter ipttable_raw ..............................(A lot of information like this, which I will omit later)
node-10 kernel:[581963.134782] i2c_i801 mfd_core usb_common [last unloaded: vfio]
node-10 kernel:[581963.134784] CPU: .......................................
node-10 kernel:[581963.134784] Hardware name: .......................................
node-10 kernel:[581963.134786] RIP: 0010:fuse_write_file_get.part.24+0x11/0x14
node-10 kernel:[581963.134786] Code: c7 48 b3 59 .......................................
node-10 kernel:[581963.134787] RSP: .......................................
node-10 kernel:[581963.134787] RAX: .......................................
node-10 kernel:[581963.134788] RDX: .......................................
node-10 kernel:[581963.134789] RBP: .......................................
node-10 kernel:[581963.134789] R10: .......................................
node-10 kernel:[581963.134789] R13: .......................................
.......................................

weilu4606 · 2024-12-17T07:42:18Z

i meet the problem too. if run the program, the ssh port 22 will be shutdown.

lmxyy added bug Something isn't working help wanted Extra attention is needed labels Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running 'example. py' in a Docker container can cause the system to freeze or be forced to shut down #22

Running 'example. py' in a Docker container can cause the system to freeze or be forced to shut down #22

ziyaxuanyi commented Nov 13, 2024

nitinmukesh commented Nov 13, 2024 •

edited

Loading

ziyaxuanyi commented Nov 13, 2024

nitinmukesh commented Nov 13, 2024 •

edited

Loading

ziyaxuanyi commented Nov 14, 2024

lmxyy commented Dec 8, 2024 •

edited

Loading

ziyaxuanyi commented Dec 11, 2024

ziyaxuanyi commented Dec 12, 2024

sxtyzhangzk commented Dec 12, 2024

ziyaxuanyi commented Dec 13, 2024

ziyaxuanyi commented Dec 13, 2024

sxtyzhangzk commented Dec 13, 2024

ziyaxuanyi commented Dec 16, 2024 •

edited

Loading

weilu4606 commented Dec 17, 2024

Running 'example. py' in a Docker container can cause the system to freeze or be forced to shut down #22

Running 'example. py' in a Docker container can cause the system to freeze or be forced to shut down #22

Comments

ziyaxuanyi commented Nov 13, 2024

nitinmukesh commented Nov 13, 2024 • edited Loading

ziyaxuanyi commented Nov 13, 2024

nitinmukesh commented Nov 13, 2024 • edited Loading

ziyaxuanyi commented Nov 14, 2024

lmxyy commented Dec 8, 2024 • edited Loading

ziyaxuanyi commented Dec 11, 2024

ziyaxuanyi commented Dec 12, 2024

sxtyzhangzk commented Dec 12, 2024

ziyaxuanyi commented Dec 13, 2024

ziyaxuanyi commented Dec 13, 2024

sxtyzhangzk commented Dec 13, 2024

ziyaxuanyi commented Dec 16, 2024 • edited Loading

weilu4606 commented Dec 17, 2024

nitinmukesh commented Nov 13, 2024 •

edited

Loading

nitinmukesh commented Nov 13, 2024 •

edited

Loading

lmxyy commented Dec 8, 2024 •

edited

Loading

ziyaxuanyi commented Dec 16, 2024 •

edited

Loading