about the memory problem #64

K-Alex13 · 2023-12-19T10:35:24Z

Each time when I interact with the model, the memory occupied by the model increases and does not release memory resources. As a result, when there are many conversations, it is very easy for the model to crash. How can I solve this problem?

hkvision · 2023-12-20T07:24:21Z

You mean when you chat with the model, the memory keeps increasing but doesn't decrease after the chat finishes?

Could you provide more details? e.g. what model you are using, any specific the code for us to reproduce this?

K-Alex13 · 2023-12-20T07:35:07Z

The detail I can provide is that I do not put the embedding to cpu and I use Baichuan2 model. The main question is the don't release memory.
Following are code
model initial code:
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True,
load_in_4bit=True).bfloat16().eval()
model = model.to('xpu')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

chat code:
response = model.chat(tokenizer, content,stream=True) (just using the original baichuan code)

Ariadne330 · 2023-12-26T02:28:15Z

I cannot reproduce your problem on Windows11 System. The memory used by CPU is quite stable as the chat stream going. Here are my steps:
HW & OS：13th Gen Intel(R) Core(TM) i9-13900K; Intel(R) Arc(TM) A770 Graphics; Windows 11
Test env: bigdl-llm 2.5.0b20231222
Note: All the results were tested without cpu embedding (which may cause more cpu usage).

Test codes

I verify the issue based on the codes provided in Baichuan2-13B-Chat repo

from bigdl.llm.transformers import AutoModelForCausalLM
import torch
import intel_extension_for_pytorch as ipex

import os
import platform
import subprocess
from colorama import Fore, Style
from tempfile import NamedTemporaryFile


model_path = "D:\llm-models\Baichuan2-13B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             trust_remote_code=True,
                                             optimize_model=True).bfloat16().eval()

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)

model = model.to('xpu')

messages = []

while True:
    prompt = input(Fore.GREEN + Style.BRIGHT + "\n用户：" + Style.NORMAL)
    if prompt.strip() == "exit":
        break
    print(Fore.CYAN + Style.BRIGHT + "\nBaichuan 2：" + Style.NORMAL, end='')

    messages.append({"role": "user", "content": prompt})
    position = 0
    try:
        for response in model.chat(tokenizer, messages, stream=True):
            print(response[position:], end='', flush=True)
            position = len(response)
            torch.xpu.empty_cache()
    except KeyboardInterrupt:
        pass
    print()
    messages.append({"role": "assistant", "content": response})

Test results

I chat ten rounds with the model and append the history in chat API and didn't notice allocated memory increases. The memory increases fast when loading the model but maintain a relatively stable level（from 40s）during the chatting stage.

Here's my code for memory capture and python script for memory usage plot.

PowerShell script for memory capture

    while($true) {
    Get-Process | Measure-Object -Property WS -Sum | ForEach-Object { "Total Memory Usage: $($_.Sum / 1MB) MB" } | Out-File test.log -Append
    Start-Sleep -Milliseconds 10
}

Python script for plotting the results

	import matplotlib.pyplot as plt
	
	data = []
	count = 0
	
	with open('./test.log', 'r',encoding="utf-16") as file:
	    for line in file.readlines()[:-2]:
	        mem = line.split()[3
	        data.append(float(mem))
	
	x = [i for i in range(len(data))]
	plt.plot(x, data, linestyle='-')
	plt.xlabel('Time')
	plt.ylabel('Used/MB')
	plt.title('Used Memory Over Time')
	plt.legend()
	plt.grid(True)
	plt.ylim(min(data)-100, max(data)+2000)
	
	plt.savefig('memory_usage_plot_load.png')

And GPU memory is at a stable level too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about the memory problem #64

about the memory problem #64

K-Alex13 commented Dec 19, 2023

hkvision commented Dec 20, 2023

K-Alex13 commented Dec 20, 2023

Ariadne330 commented Dec 26, 2023 •

edited

Loading

about the memory problem #64

about the memory problem #64

Comments

K-Alex13 commented Dec 19, 2023

hkvision commented Dec 20, 2023

K-Alex13 commented Dec 20, 2023

Ariadne330 commented Dec 26, 2023 • edited Loading

Test codes

Test results

Ariadne330 commented Dec 26, 2023 •

edited

Loading