Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Excesive RAM overhead in Cortex when loading a model #4727

Open
1 of 3 tasks
mtomas7 opened this issue Feb 24, 2025 · 2 comments
Open
1 of 3 tasks

bug: Excesive RAM overhead in Cortex when loading a model #4727

mtomas7 opened this issue Feb 24, 2025 · 2 comments
Labels
type: bug Something isn't working

Comments

@mtomas7
Copy link

mtomas7 commented Feb 24, 2025

Jan version

0.5.15 Win11

Describe the Bug

First, it looked to me that Cortex loads the model twice. Jan's memory usage got out of control, it's approaching a double the amount of the model. I loaded Mistral-Small-24B-Instruct-2501-Q8_0 that is 23.33GB and after loading, memory went up by 38GB, that's 14.67GB overhead!

Perhaps it loads the model with very big context window that balloons the memory usage?

I used to use some models on my laptop with LM Studio and they would work without issues, but today I tried to use them with Jan and they failed due to lack of memory. I then tried to load small 3B model that usually takes 3GB of RAM with LM Studio and noticed that after loading model in Jan, my RAM usage increased ~6GB. So laptop with 16GB of RAM now could not load 7GB model :(

Many users with bootstrapped systems at home are clearly at disadvantage with this memory leak.
Also it causes "Model failed to load" error for those users who think that they have enough RAM to run the model. In my experience RAM (VRAM+RAM) requirements in industry are Model size + 1GB.

Steps to Reproduce

No response

Screenshots / Logs

No response

What is your OS?

  • MacOS
  • Windows
  • Linux
@mtomas7 mtomas7 added the type: bug Something isn't working label Feb 24, 2025
@github-project-automation github-project-automation bot moved this to Investigating in Menlo Feb 24, 2025
@mtomas7 mtomas7 changed the title bug: Cortex loads the model twice? bug: excesive RAM overhead in Cortex when loading a model Feb 26, 2025
@mtomas7 mtomas7 changed the title bug: excesive RAM overhead in Cortex when loading a model bug: Excesive RAM overhead in Cortex when loading a model Feb 26, 2025
@louis-menlo
Copy link
Contributor

A rough estimate, can we know the current model and llama.cpp settings? We are planning to integrate a more accurate estimation tool, rearrange settings, and provide better guidance for improved observation. Like what settings would cause which side effect and what is the benefit of that. E.g. disabling cache or changing the KV cache quantization level reduces memory consumption but is slow.

Image

@mtomas7
Copy link
Author

mtomas7 commented Feb 27, 2025

64K is pretty high context length. I was using with 8192. I will check how LM Studio and AnythingLLM deals with memory to have a comparison.

My settings (although I see this memory issue with any model):

Image
Image

# BEGIN GENERAL GGUF METADATA
id: Mistral-Small-24B-Instruct-2501 # Model ID unique between models (author / quantization)
model: Mistral-Small-24B-Instruct-2501-Q8_0 # Model ID which is used for request construct - should be unique between models (author / quantization)
name: Mistral-Small-24B-Instruct-2501-Q8_0 # metadata.general.name
version: 2
files:             # Can be relative OR absolute local file path
  - E:\AI\models\bartowski\Mistral-Small-24B-Instruct-2501-GGUF\Mistral-Small-24B-Instruct-2501-Q8_0.gguf
# END GENERAL GGUF METADATA

# BEGIN INFERENCE PARAMETERS
# BEGIN REQUIRED
stop:                # tokenizer.ggml.eos_token_id
  - </s>
# END REQUIRED

# BEGIN OPTIONAL
size: 25054779072
stream: true # Default true?
top_p: 0.95 # Ranges: 0 to 1
temperature: 0.15 # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0 # Ranges: 0 to 1
max_tokens: 8192 # Should be default to context length
seed: -1
dynatemp_range: 0
dynatemp_exponent: 1
top_k: 40
min_p: 0.05
tfs_z: 1
typ_p: 1
repeat_last_n: 64
repeat_penalty: 1
mirostat: false
mirostat_tau: 5
mirostat_eta: 0.1
penalize_nl: false
ignore_eos: false
n_probs: 0
min_keep: 0
# END OPTIONAL
# END INFERENCE PARAMETERS

# BEGIN MODEL LOAD PARAMETERS
# BEGIN REQUIRED
engine: llama-cpp # engine to run model
prompt_template: "[INST] {system_message}\n[INST] {prompt} [/INST]"
# END REQUIRED

# BEGIN OPTIONAL
ctx_len: 8192 # llama.context_length | 0 or undefined = loaded from model
n_parallel: 1
ngl: 41 # Undefined = loaded from model
# END OPTIONAL
# END MODEL LOAD PARAMETERS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Status: Investigating
Development

No branches or pull requests

2 participants