Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model loaded with PretrainedModel.from_pretrained and with torch.device("cuda"): decorator leads to unexpected errors compared to .to("cuda") #35371

Open
2 of 4 tasks
fxmarty-amd opened this issue Dec 20, 2024 · 2 comments
Labels

Comments

@fxmarty-amd
Copy link

fxmarty-amd commented Dec 20, 2024

System Info

- `transformers` version: 4.48.0.dev0
- Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.39
- Python version: 3.10.14
- Huggingface_hub version: 0.26.3
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config:    not found
- PyTorch version (GPU?): 2.5.1+rocm6.2 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: AMD Instinct MI250X/MI250

transformers commit 4567ee8

Who can help?

@mht-sharma maybe you know

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, AutoConfig
import torch

model_id = "Qwen/Qwen1.5-MoE-A2.7B-Chat"

cfg = AutoConfig.from_pretrained(model_id)
# cfg.num_hidden_layers = 4

with torch.device("cuda"):
   model = AutoModelForCausalLM.from_config(cfg, torch_dtype=torch.bfloat16)

# with torch.device("cuda"):
#    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)

param_size = 0
for name, param in model.named_parameters():
    param_size += param.nelement() * param.element_size()
    print(name, param.dtype)
buffer_size = 0
for name, buffer in model.named_buffers():
    buffer_size += buffer.nelement() * buffer.element_size()
    print(name, buffer.dtype)

size_all_gb = (param_size + buffer_size) * 1e-9
print('model size: {:.3f} GB'.format(size_all_gb))

tokenizer = AutoTokenizer.from_pretrained(model_id)

inp = tokenizer("Hello my friends, how are you?", return_tensors="pt").to("cuda")

gen_config = GenerationConfig(
    max_new_tokens=100,
    min_new_tokens=100,
    use_cache=True,
    num_beams=1,
    do_sample=False,
)

print("generating")
res = model.generate(**inp, generation_config=gen_config)

print(tokenizer.batch_decode(res))

When using with torch.device("cuda"), to load a model on device, I am getting various unexpected errors as HIPBLAS_STATUS_INTERNAL_ERROR when calling hipblasLtMatmul or RuntimeError: HIP error: no kernel image is available for execution on the device.

However, when loading a (dummy) model with

with torch.device("cuda"):
   model = AutoModelForCausalLM.from_config(cfg, torch_dtype=torch.bfloat16)

everything is fine at runtime, no error.

Similarly, when loading with

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")

there is no error at runtime.

I do not have access to an Nvidia GPU at the time so could not reproduce there to see if this issue exists also on Nvidia distribution pytorch. So I am not sure if this is a ROCm, PyTorch or Transformers bug, this might need some investigations.

I could reproduce the issue on a few previous Transformers versions (4.45, 4.46, 4.47).

Filling for awareness, this might need some more investigation and/or extended testing in Transformers CI.

Interestingly I could not reproduce this issue with peft-internal-testing/tiny-random-qwen-1.5-MoE, but only with Qwen/Qwen1.5-MoE-A2.7B-Chat.

Expected behavior

No error

@fxmarty-amd
Copy link
Author

fxmarty-amd commented Dec 20, 2024

The issue can not be reproduced with smaller models (e.g. this qwen moe with only 20 layers instead of 24).

Using the original failing Qwen/Qwen1.5-MoE-A2.7B-Chat with 24 layers and inspecting torch.cuda.memory_reserved/torch.cuda.memory_allocated, it appears that when using the decorator with torch.device("cuda"):, PyTorch/Transformers reserves much more memory than the model size (number after loading):

model size: 28.632 GB
memory reserved GB: 68.176314368
memory allocated GB: 28.651504128000003

The reserved memory is very close to the 64 GB of one MI250 GCD memory (~64 GiB), and hence some AMD libs/pytorch implem for rocm might act fuzzily, although I would rather expect the classic HIP out of memory. Tried to allocate xxxx etc message.

Compare to using model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda"):

model size: 28.632 GB
memory reserved GB: 34.120663040000004
memory allocated GB: 28.63262976

Allocated memory is similar, though.

@Rocketknight1
Copy link
Member

Hmmn - there is no direct memory allocation in transformers that I'm aware of that doesn't go through Torch. Is the reserved memory because of a spike in usage during loading? If so, we might be able to mitigate it, but if it's just Torch reserving/fragmenting memory I'm not sure what we can do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants