-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about the memory problem #64
Comments
You mean when you chat with the model, the memory keeps increasing but doesn't decrease after the chat finishes? Could you provide more details? e.g. what model you are using, any specific the code for us to reproduce this? |
The detail I can provide is that I do not put the embedding to cpu and I use Baichuan2 model. The main question is the don't release memory. chat code: |
I cannot reproduce your problem on Windows11 System. The memory used by CPU is quite stable as the chat stream going. Here are my steps: Test codesI verify the issue based on the codes provided in Baichuan2-13B-Chat repo from bigdl.llm.transformers import AutoModelForCausalLM
import torch
import intel_extension_for_pytorch as ipex
import os
import platform
import subprocess
from colorama import Fore, Style
from tempfile import NamedTemporaryFile
model_path = "D:\llm-models\Baichuan2-13B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True,
trust_remote_code=True,
optimize_model=True).bfloat16().eval()
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
model = model.to('xpu')
messages = []
while True:
prompt = input(Fore.GREEN + Style.BRIGHT + "\n用户:" + Style.NORMAL)
if prompt.strip() == "exit":
break
print(Fore.CYAN + Style.BRIGHT + "\nBaichuan 2:" + Style.NORMAL, end='')
messages.append({"role": "user", "content": prompt})
position = 0
try:
for response in model.chat(tokenizer, messages, stream=True):
print(response[position:], end='', flush=True)
position = len(response)
torch.xpu.empty_cache()
except KeyboardInterrupt:
pass
print()
messages.append({"role": "assistant", "content": response}) Test resultsI chat ten rounds with the model and append the history in Here's my code for memory capture and python script for memory usage plot. PowerShell script for memory capture while($true) {
Get-Process | Measure-Object -Property WS -Sum | ForEach-Object { "Total Memory Usage: $($_.Sum / 1MB) MB" } | Out-File test.log -Append
Start-Sleep -Milliseconds 10
} Python script for plotting the results import matplotlib.pyplot as plt
data = []
count = 0
with open('./test.log', 'r',encoding="utf-16") as file:
for line in file.readlines()[:-2]:
mem = line.split()[3
data.append(float(mem))
x = [i for i in range(len(data))]
plt.plot(x, data, linestyle='-')
plt.xlabel('Time')
plt.ylabel('Used/MB')
plt.title('Used Memory Over Time')
plt.legend()
plt.grid(True)
plt.ylim(min(data)-100, max(data)+2000)
plt.savefig('memory_usage_plot_load.png') And GPU memory is at a stable level too. |
Each time when I interact with the model, the memory occupied by the model increases and does not release memory resources. As a result, when there are many conversations, it is very easy for the model to crash. How can I solve this problem?
The text was updated successfully, but these errors were encountered: