bug: Excesive RAM overhead in Cortex when loading a model #4727

mtomas7 · 2025-02-24T17:10:20Z

Jan version

0.5.15 Win11

Describe the Bug

First, it looked to me that Cortex loads the model twice. Jan's memory usage got out of control, it's approaching a double the amount of the model. I loaded Mistral-Small-24B-Instruct-2501-Q8_0 that is 23.33GB and after loading, memory went up by 38GB, that's 14.67GB overhead!

Perhaps it loads the model with very big context window that balloons the memory usage?

I used to use some models on my laptop with LM Studio and they would work without issues, but today I tried to use them with Jan and they failed due to lack of memory. I then tried to load small 3B model that usually takes 3GB of RAM with LM Studio and noticed that after loading model in Jan, my RAM usage increased ~6GB. So laptop with 16GB of RAM now could not load 7GB model :(

Many users with bootstrapped systems at home are clearly at disadvantage with this memory leak.
Also it causes "Model failed to load" error for those users who think that they have enough RAM to run the model. In my experience RAM (VRAM+RAM) requirements in industry are Model size + 1GB.

Steps to Reproduce

No response

Screenshots / Logs

No response

What is your OS?

MacOS
Windows
Linux

louis-menlo · 2025-02-27T08:39:50Z

A rough estimate, can we know the current model and llama.cpp settings? We are planning to integrate a more accurate estimation tool, rearrange settings, and provide better guidance for improved observation. Like what settings would cause which side effect and what is the benefit of that. E.g. disabling cache or changing the KV cache quantization level reduces memory consumption but is slow.

mtomas7 · 2025-02-27T19:35:11Z

64K is pretty high context length. I was using with 8192. I will check how LM Studio and AnythingLLM deals with memory to have a comparison.

My settings (although I see this memory issue with any model):

# BEGIN GENERAL GGUF METADATA
id: Mistral-Small-24B-Instruct-2501 # Model ID unique between models (author / quantization)
model: Mistral-Small-24B-Instruct-2501-Q8_0 # Model ID which is used for request construct - should be unique between models (author / quantization)
name: Mistral-Small-24B-Instruct-2501-Q8_0 # metadata.general.name
version: 2
files:             # Can be relative OR absolute local file path
  - E:\AI\models\bartowski\Mistral-Small-24B-Instruct-2501-GGUF\Mistral-Small-24B-Instruct-2501-Q8_0.gguf
# END GENERAL GGUF METADATA

# BEGIN INFERENCE PARAMETERS
# BEGIN REQUIRED
stop:                # tokenizer.ggml.eos_token_id
  - </s>
# END REQUIRED

# BEGIN OPTIONAL
size: 25054779072
stream: true # Default true?
top_p: 0.95 # Ranges: 0 to 1
temperature: 0.15 # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0 # Ranges: 0 to 1
max_tokens: 8192 # Should be default to context length
seed: -1
dynatemp_range: 0
dynatemp_exponent: 1
top_k: 40
min_p: 0.05
tfs_z: 1
typ_p: 1
repeat_last_n: 64
repeat_penalty: 1
mirostat: false
mirostat_tau: 5
mirostat_eta: 0.1
penalize_nl: false
ignore_eos: false
n_probs: 0
min_keep: 0
# END OPTIONAL
# END INFERENCE PARAMETERS

# BEGIN MODEL LOAD PARAMETERS
# BEGIN REQUIRED
engine: llama-cpp # engine to run model
prompt_template: "[INST] {system_message}\n[INST] {prompt} [/INST]"
# END REQUIRED

# BEGIN OPTIONAL
ctx_len: 8192 # llama.context_length | 0 or undefined = loaded from model
n_parallel: 1
ngl: 41 # Undefined = loaded from model
# END OPTIONAL
# END MODEL LOAD PARAMETERS

mtomas7 added the type: bug Something isn't working label Feb 24, 2025

github-project-automation bot added this to Menlo Feb 24, 2025

github-project-automation bot moved this to Investigating in Menlo Feb 24, 2025

mtomas7 changed the title ~~bug: Cortex loads the model twice?~~ bug: excesive RAM overhead in Cortex when loading a model Feb 26, 2025

mtomas7 changed the title ~~bug: excesive RAM overhead in Cortex when loading a model~~ bug: Excesive RAM overhead in Cortex when loading a model Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Excesive RAM overhead in Cortex when loading a model #4727

bug: Excesive RAM overhead in Cortex when loading a model #4727

mtomas7 commented Feb 24, 2025 •

edited

Loading

louis-menlo commented Feb 27, 2025

mtomas7 commented Feb 27, 2025

bug: Excesive RAM overhead in Cortex when loading a model #4727

bug: Excesive RAM overhead in Cortex when loading a model #4727

Comments

mtomas7 commented Feb 24, 2025 • edited Loading

Jan version

Describe the Bug

Steps to Reproduce

Screenshots / Logs

What is your OS?

louis-menlo commented Feb 27, 2025

mtomas7 commented Feb 27, 2025

mtomas7 commented Feb 24, 2025 •

edited

Loading