Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running 'example. py' in a Docker container can cause the system to freeze or be forced to shut down #22

Open
ziyaxuanyi opened this issue Nov 13, 2024 · 13 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@ziyaxuanyi
Copy link

I am running it in a Docker container and nunkaku has been successfully compiled.

I replaced the schnell model in example. py with the dev model and ran it.
When the program reaches the prompt to load svdq-int4-flux.1-dev.safedetectors and outputs' Done '.
The system is stuck or forced to shut down

Warning of very high system memory usage detected when stuck or forced to shut down

May I ask if the current version of the code requires extremely high memory usage and approximately how much memory is needed
Or whether there are memory management or leakage issues

@nitinmukesh
Copy link

nitinmukesh commented Nov 13, 2024

Are you facing this issue all the time or it is happening occasionally.

I am running on 8GB VRAM / Windows 11 and it happens randomly. It happens during loading of model, if model loads inference is smooth.

@ziyaxuanyi
Copy link
Author

Are you facing this issue all the time or it is happening occasionally.

I am running on 8GB VRAM / Windows 11 and it happens randomly. It happens during loading of model, if model loads inference is smooth.

always.

@nitinmukesh
Copy link

nitinmukesh commented Nov 13, 2024

Try this
After this code add cpu_offload.

pipeline = nunchaku_flux.from_pretrained(
        "black-forest-labs/FLUX.1-dev",
        torch_dtype=torch.bfloat16,
        qmodel_path="mit-han-lab/svdquant-models/svdq-int4-flux.1-dev.safetensors",
    )

pipeline.enable_sequential_cpu_offload()

@ziyaxuanyi
Copy link
Author

Try this After this code add cpu_offload.

pipeline = nunchaku_flux.from_pretrained(
        "black-forest-labs/FLUX.1-dev",
        torch_dtype=torch.bfloat16,
        qmodel_path="mit-han-lab/svdquant-models/svdq-int4-flux.1-dev.safetensors",
    )

pipeline.enable_sequential_cpu_offload()

But I don't think it's a problem with the VRAM. My VRAM has 24GB, which is sufficient. It's likely a problem with the CPU

@lmxyy
Copy link
Collaborator

lmxyy commented Dec 8, 2024

I see. This issue seems similar to #40 Previously, we instantiated the 16-bit DiT on CPU, so it consumes about 24G RAM. We have optimized the process and now it only consumes ~10G RAM. You can pull the latest code and try again.

@ziyaxuanyi
Copy link
Author

I see. This issue seems similar to #40 Previously, we instantiated the 16-bit DiT on CPU, so it consumes about 24G RAM. We have optimized the process and now it only consumes ~10G RAM. You can pull the latest code and try again.

Great! It solved my problem. Thank you for your great works!

@ziyaxuanyi
Copy link
Author

I see. This issue seems similar to #40 Previously, we instantiated the 16-bit DiT on CPU, so it consumes about 24G RAM. We have optimized the process and now it only consumes ~10G RAM. You can pull the latest code and try again.

Great! It solved my problem. Thank you for your great works!

@lmxyy
Unfortunately, I found that the problem still exists.
When I change a machine, the error still occurs under the same environment and code
So far, I have successfully deployed and tested example on only one machine, but it has failed on several other machines, and all of them show this error phenomenon
I think there may be serious bugs in your CUDA level implementation code

@sxtyzhangzk
Copy link
Collaborator

Hi,

Are you using docker on Windows or native Linux?
Does rebooting solve the problem temporarily?

@ziyaxuanyi
Copy link
Author

Hi,

Are you using docker on Windows or native Linux? Does rebooting solve the problem temporarily?

using docker on native Linux.
No, it doesn't.

@ziyaxuanyi
Copy link
Author

I have discovered another issue.
Running 'example. py' will cause the system's log files, /var/log/kern.log and /var/log/messages, to rapidly expand and grow, ultimately exhausting system space and causing container crashes

@sxtyzhangzk
Copy link
Collaborator

I have discovered another issue. Running 'example. py' will cause the system's log files, /var/log/kern.log and /var/log/messages, to rapidly expand and grow, ultimately exhausting system space and causing container crashes

Good finding. Could you look into kern.log (or using dmesg) and see what is happening with the kernel?

@ziyaxuanyi
Copy link
Author

ziyaxuanyi commented Dec 16, 2024

I have discovered another issue. Running 'example. py' will cause the system's log files, /var/log/kern.log and /var/log/messages, to rapidly expand and grow, ultimately exhausting system space and causing container crashes

Good finding. Could you look into kern.log (or using dmesg) and see what is happening with the kernel?

Due to work reasons, I can't post the specific log.
I can only describe it briefly as follows:
node-10 kernel:[581963.134759] WARNING: CPU: 95 PID: 74299 at fs/fuse/file.c:1626 fuse_write_file_get.part.24+0x11/0x14 [fuse]
node-10 kernel:[581963.134760] Modules linked in: nvidia_uvm(OE) nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) ipt_rpfilter ipttable_raw ..............................(A lot of information like this, which I will omit later)
node-10 kernel:[581963.134782] i2c_i801 mfd_core usb_common [last unloaded: vfio]
node-10 kernel:[581963.134784] CPU: .......................................
node-10 kernel:[581963.134784] Hardware name: .......................................
node-10 kernel:[581963.134786] RIP: 0010:fuse_write_file_get.part.24+0x11/0x14
node-10 kernel:[581963.134786] Code: c7 48 b3 59 .......................................
node-10 kernel:[581963.134787] RSP: .......................................
node-10 kernel:[581963.134787] RAX: .......................................
node-10 kernel:[581963.134788] RDX: .......................................
node-10 kernel:[581963.134789] RBP: .......................................
node-10 kernel:[581963.134789] R10: .......................................
node-10 kernel:[581963.134789] R13: .......................................
.......................................

@weilu4606
Copy link

i meet the problem too. if run the program, the ssh port 22 will be shutdown.

@lmxyy lmxyy added bug Something isn't working help wanted Extra attention is needed labels Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants