Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can MPS use FP16 when training?Why I can't? #32648

Open
2 of 4 tasks
AimoneAndex opened this issue Aug 13, 2024 · 3 comments
Open
2 of 4 tasks

Can MPS use FP16 when training?Why I can't? #32648

AimoneAndex opened this issue Aug 13, 2024 · 3 comments

Comments

@AimoneAndex
Copy link

AimoneAndex commented Aug 13, 2024

System Info

Device:Apple M3 Pro
OS:macOS Sonoma 14.1
packages:
datasets 2.20.1.dev0
evaluate 0.4.2
huggingface-hub 0.23.5
tokenizers 0.19.1
torch 2.5.0.dev20240717
torchaudio 2.4.0.dev20240717
torchvision 0.20.0.dev20240717

Who can help?

@ArthurZucker @muellerzr

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import os
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
GenerationConfig,
DataCollatorForSeq2Seq
)
from datasets import Dataset, load_dataset
from peft import LoraConfig, TaskType, get_peft_model, PeftModel, PeftConfig
import torch

ds_name = input('请输入等待训练的数据集(csv文件)名称(不包含后缀)')
model_name = input('请输入等待训练的模型名称(子文件夹名称)')
save_name = input('请输入希望保存的lora名称')

current_dir = os.getcwd()
save_dir = os.path.join(current_dir, 'model_saved', save_name)
os.makedirs(save_dir, exist_ok=True)

target_file_path = os.path.join(current_dir, 'datasets', ds_name + '.csv')
model_dir = os.path.join(current_dir, 'model', model_name)

dataset = load_dataset("csv", data_files=target_file_path, split="train")

tokenizer = AutoTokenizer.from_pretrained(model_dir)
tokenizer.padding_side = "right"
tokenizer.pad_token_id = 2

def process_func(example):
MAX_LENGTH = 384
instruction = example.get("instruction", "")
input_text = example.get("input", "")
prompt = f"Human: {instruction}\n{input_text}".strip() if input_text else f"Human: {instruction}".strip()
instruction_tokenized = tokenizer(prompt + "\n\nAssistant: ", add_special_tokens=False)
response_tokenized = tokenizer(example["output"], add_special_tokens=False
input_ids = instruction_tokenized["input_ids"] + response_tokenized["input_ids"] + [tokenizer.eos_token_id]
attention_mask = instruction_tokenized["attention_mask"] + response_tokenized["attention_mask"] + [1]
labels = [-100] * len(instruction_tokenized["input_ids"]) + response_tokenized["input_ids"] + [tokenizer.eos_token_id]
if len(input_ids) > MAX_LENGTH:
input_ids = input_ids[:MAX_LENGTH]
attention_mask = attention_mask[:MAX_LENGTH]
labels = labels[:MAX_LENGTH]
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels
}

tokenized_dataset = dataset.map(process_func, remove_columns=dataset.column_names)
print(tokenized_dataset)

device = torch.device("mps")
model = AutoModelForCausalLM.from_pretrained(
model_dir,
low_cpu_mem_usage=True,
torch_dtype=torch.half
)
model = model.to(device)

config = LoraConfig(task_type=TaskType.CAUSAL_LM)
model = get_peft_model(model, config)
model.print_trainable_parameters()
model=model.half

args = TrainingArguments(
output_dir=save_dir,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
logging_steps=10,
num_train_epochs=2,
)

trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_dataset,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)

trainer.train()

Expected behavior

Please let transformers not show me the error below again.Thanks for everyone!
ValueError Traceback (most recent call last)
Cell In[16], line 1
----> 1 trainer = Trainer(
2 model=model,
3 args=args,
4 train_dataset=tokenized_dataset,
5 data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
6 )

File ~/Data/AIHub/Trans-Penv/transformers/src/transformers/trainer.py:409, in Trainer.init(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
406 self.deepspeed = None
407 self.is_in_train = False
--> 409 self.create_accelerator_and_postprocess()
411 # memory metrics - must set up as early as possible
412 self._memory_tracker = TrainerMemoryTracker(self.args.skip_memory_metrics)

File ~/Data/AIHub/Trans-Penv/transformers/src/transformers/trainer.py:4648, in Trainer.create_accelerator_and_postprocess(self)
4645 args.update(accelerator_config)
4647 # create accelerator object
-> 4648 self.accelerator = Accelerator(**args)
4649 # some Trainer classes need to use gather instead of gather_for_metrics, thus we store a flag
4650 self.gather_function = self.accelerator.gather_for_metrics

File /opt/anaconda3/envs/tfs/lib/python3.12/site-packages/accelerate/accelerator.py:467, in Accelerator.init(self, device_placement, split_batches, mixed_precision, gradient_accumulation_steps, cpu, dataloader_config, deepspeed_plugin, fsdp_plugin, megatron_lm_plugin, rng_types, log_with, project_dir, project_config, gradient_accumulation_plugin, dispatch_batches, even_batches, use_seedable_sampler, step_scheduler_with_optimizer, kwargs_handlers, dynamo_backend)
...
--> 467 raise ValueError(f"fp16 mixed precision requires a GPU (not {self.device.type!r}).")
468 kwargs = self.scaler_handler.to_kwargs() if self.scaler_handler is not None else {}
469 if self.distributed_type == DistributedType.FSDP:

ValueError: fp16 mixed precision requires a GPU (not 'mps').

@muellerzr
Copy link
Contributor

muellerzr commented Aug 13, 2024

Keeping the other issue closed and commenting over here: #32035 (comment)

TL;DR it's in the torch nightlies, PyTorch only merged support like last week. Once it's on a stable release we'll enable it

@andimarafioti andimarafioti linked a pull request Aug 13, 2024 that will close this issue
5 tasks
@andimarafioti andimarafioti removed a link to a pull request Aug 13, 2024
5 tasks
@AimoneAndex
Copy link
Author

Keeping the other issue closed and commenting over here: #32035 (comment)

TL;DR it's in the torch nightlies, PyTorch only merged support like last week. Once it's on a stable release we'll enable it

OK!Thanks a lot!

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants