Model loaded with `PretrainedModel.from_pretrained` and `with torch.device("cuda"):` decorator leads to unexpected errors compared to `.to("cuda")` #35371

fxmarty-amd · 2024-12-20T16:00:15Z

System Info

- `transformers` version: 4.48.0.dev0
- Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.39
- Python version: 3.10.14
- Huggingface_hub version: 0.26.3
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config:    not found
- PyTorch version (GPU?): 2.5.1+rocm6.2 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: AMD Instinct MI250X/MI250

transformers commit 4567ee8

Who can help?

@mht-sharma maybe you know

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, AutoConfig
import torch

model_id = "Qwen/Qwen1.5-MoE-A2.7B-Chat"

cfg = AutoConfig.from_pretrained(model_id)
# cfg.num_hidden_layers = 4

with torch.device("cuda"):
   model = AutoModelForCausalLM.from_config(cfg, torch_dtype=torch.bfloat16)

# with torch.device("cuda"):
#    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)

param_size = 0
for name, param in model.named_parameters():
    param_size += param.nelement() * param.element_size()
    print(name, param.dtype)
buffer_size = 0
for name, buffer in model.named_buffers():
    buffer_size += buffer.nelement() * buffer.element_size()
    print(name, buffer.dtype)

size_all_gb = (param_size + buffer_size) * 1e-9
print('model size: {:.3f} GB'.format(size_all_gb))

tokenizer = AutoTokenizer.from_pretrained(model_id)

inp = tokenizer("Hello my friends, how are you?", return_tensors="pt").to("cuda")

gen_config = GenerationConfig(
    max_new_tokens=100,
    min_new_tokens=100,
    use_cache=True,
    num_beams=1,
    do_sample=False,
)

print("generating")
res = model.generate(**inp, generation_config=gen_config)

print(tokenizer.batch_decode(res))

When using with torch.device("cuda"), to load a model on device, I am getting various unexpected errors as HIPBLAS_STATUS_INTERNAL_ERROR when calling hipblasLtMatmul or RuntimeError: HIP error: no kernel image is available for execution on the device.

However, when loading a (dummy) model with

with torch.device("cuda"):
   model = AutoModelForCausalLM.from_config(cfg, torch_dtype=torch.bfloat16)

everything is fine at runtime, no error.

Similarly, when loading with

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")

there is no error at runtime.

I do not have access to an Nvidia GPU at the time so could not reproduce there to see if this issue exists also on Nvidia distribution pytorch. So I am not sure if this is a ROCm, PyTorch or Transformers bug, this might need some investigations.

I could reproduce the issue on a few previous Transformers versions (4.45, 4.46, 4.47).

Filling for awareness, this might need some more investigation and/or extended testing in Transformers CI.

Interestingly I could not reproduce this issue with peft-internal-testing/tiny-random-qwen-1.5-MoE, but only with Qwen/Qwen1.5-MoE-A2.7B-Chat.

Expected behavior

No error

The text was updated successfully, but these errors were encountered:

fxmarty-amd · 2024-12-20T16:23:08Z

The issue can not be reproduced with smaller models (e.g. this qwen moe with only 20 layers instead of 24).

Using the original failing Qwen/Qwen1.5-MoE-A2.7B-Chat with 24 layers and inspecting torch.cuda.memory_reserved/torch.cuda.memory_allocated, it appears that when using the decorator with torch.device("cuda"):, PyTorch/Transformers reserves much more memory than the model size (number after loading):

model size: 28.632 GB
memory reserved GB: 68.176314368
memory allocated GB: 28.651504128000003

The reserved memory is very close to the 64 GB of one MI250 GCD memory (~64 GiB), and hence some AMD libs/pytorch implem for rocm might act fuzzily, although I would rather expect the classic HIP out of memory. Tried to allocate xxxx etc message.

Compare to using model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda"):

model size: 28.632 GB
memory reserved GB: 34.120663040000004
memory allocated GB: 28.63262976

Allocated memory is similar, though.

Rocketknight1 · 2024-12-20T18:36:56Z

Hmmn - there is no direct memory allocation in transformers that I'm aware of that doesn't go through Torch. Is the reserved memory because of a spike in usage during loading? If so, we might be able to mitigate it, but if it's just Torch reserving/fragmenting memory I'm not sure what we can do!

fxmarty-amd added the bug label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model loaded with `PretrainedModel.from_pretrained` and `with torch.device("cuda"):` decorator leads to unexpected errors compared to `.to("cuda")` #35371

Model loaded with `PretrainedModel.from_pretrained` and `with torch.device("cuda"):` decorator leads to unexpected errors compared to `.to("cuda")` #35371

fxmarty-amd commented Dec 20, 2024 •

edited

Loading

fxmarty-amd commented Dec 20, 2024 •

edited

Loading

Rocketknight1 commented Dec 20, 2024

Model loaded with PretrainedModel.from_pretrained and with torch.device("cuda"): decorator leads to unexpected errors compared to .to("cuda") #35371

Model loaded with PretrainedModel.from_pretrained and with torch.device("cuda"): decorator leads to unexpected errors compared to .to("cuda") #35371

Comments

fxmarty-amd commented Dec 20, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

fxmarty-amd commented Dec 20, 2024 • edited Loading

Rocketknight1 commented Dec 20, 2024

Model loaded with `PretrainedModel.from_pretrained` and `with torch.device("cuda"):` decorator leads to unexpected errors compared to `.to("cuda")` #35371

Model loaded with `PretrainedModel.from_pretrained` and `with torch.device("cuda"):` decorator leads to unexpected errors compared to `.to("cuda")` #35371

fxmarty-amd commented Dec 20, 2024 •

edited

Loading

fxmarty-amd commented Dec 20, 2024 •

edited

Loading