Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ERROR] Worker (pid:25134) was sent SIGKILL! Perhaps out of memory? #556

Open
UTSAV-44 opened this issue Jul 18, 2024 · 12 comments
Open

[ERROR] Worker (pid:25134) was sent SIGKILL! Perhaps out of memory? #556

UTSAV-44 opened this issue Jul 18, 2024 · 12 comments

Comments

@UTSAV-44
Copy link

UTSAV-44 commented Jul 18, 2024

Hi,turboderp!,

I am using A10 gpu with 24 gb ram for inferencing LLama3 .I am gunicorn with workers count 2 but It is giving Perhaps out of memory?.It is using 13 gb out of 24gb only ,but still showing Running out of VRAM

@remichu-ai
Copy link

I think worker as 2 will require double the memory requirement for GPU. 13*2>24

@UTSAV-44
Copy link
Author

I think worker as 2 will require double the memory requirement for GPU. 13*2>24

I have observed that 13gb is used when worker is 2.

@turboderp
Copy link
Owner

turboderp commented Jul 18, 2024

There is a known issue with safetensors that only shows up on some systems. Windows especially suffers from it, but I've seen it reported on some Linux systems as well. I think it has to do with memory mapping not working properly when you have too many files open at once, or something like that.

There is an option to bypass safetensors when loading models, which can be enabled with either -fst on the command line, setting the EXLLAMA_FASTTENSORS env variable, or setting config.fasttensors = True in Python.

@UTSAV-44
Copy link
Author

UTSAV-44 commented Jul 18, 2024

Does it depends on NVIDIA driver version and cuda version.At present Driver Version: 535.183.01 and CUDA Version: 12.2.We are running it on Ubuntu 22.04

@turboderp
Copy link
Owner

No, it's an issue with safetensors and/or possibly the OS kernel. Try using one of the options above to see if it helps.

@UTSAV-44
Copy link
Author

I tried with setting config.fasttensors = True ,but it does not work out.I tried using this in g4dn.xlarge instance but the model is not loading .

@turboderp
Copy link
Owner

Can you share the code that fails? The config option has to be set after config.prepare() is called but before model.load().

@UTSAV-44
Copy link
Author

config = ExLlamaV2Config(model_dir)
config.fasttensors = True
self.model = ExLlamaV2(config)

	self.cache = ExLlamaV2Cache_Q4(self.model, max_seq_len=256*96, lazy=True)  
	self.model.load_autosplit(self.cache, progress=True)

	print("Loading tokenizer...")
	self.tokenizer = ExLlamaV2Tokenizer(config)

	self.generator = ExLlamaV2DynamicGenerator(
		model=self.model,
		cache=self.cache,
		tokenizer=self.tokenizer,
	)

	self.generator.warmup()

I am running it on kubernetes with g5.xlarge gpu instance.

@turboderp
Copy link
Owner

I'm not sure there's any way to prevent PyTorch from using a lot of virtual memory. But just out of interest, what do you get from the following?

cat /proc/sys/vm/overcommit_memory
ulimit -v

@UTSAV-44
Copy link
Author

for the command
cat /proc/sys/vm/overcommit_memory I got 1
ulimit -v I got unlimited

@turboderp
Copy link
Owner

I'm not sure about the implications actually, but I think you might want to try changing the overcommit mode.

sudo sysctl vm.overcommit_memory=0

or

sudo sysctl vm.overcommit_memory=2

🤷

@brthor
Copy link

brthor commented Aug 17, 2024

@turboderp
EDIT: The sigkill issue I detailed here was due to serialization of exllama state by the huggingface datasets.map() function when the exllama model is pre-initialized and is unrelated to exllama.

Reducing the cache size appeared to help because the cache state was being serialized.

If anyone else hits this issue, passing the new_fingerprint='some_rnd_str' to datasets.map() will prevent the serialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants