Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sft have bug while lora run successfully #6405

Open
1 task done
TimeFlysLeo opened this issue Dec 20, 2024 · 0 comments
Open
1 task done

sft have bug while lora run successfully #6405

TimeFlysLeo opened this issue Dec 20, 2024 · 0 comments
Labels
pending This problem is yet to be addressed

Comments

@TimeFlysLeo
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
  • Python version: 3.10.16
  • PyTorch version: 2.5.1+cu124 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 4090 D
  • DeepSpeed version: 0.15.4
  • Bitsandbytes version: 0.45.0

Reproduction

0/1000 [00:00<?, ?it/s]/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
[WARNING|logging.py:168] 2024-12-20 18:40:09,732 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank1]: return inner_training_loop(
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank1]: self.engine.backward(loss, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank1]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank1]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank1]: scaled_loss.backward(retain_graph=retain_graph)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank1]: return user_fn(self, *args)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank1]: raise RuntimeError(
[rank1]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank2]: launch()
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank2]: run_exp()
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank2]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank2]: return inner_training_loop(
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank2]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank2]: self.accelerator.backward(loss, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank2]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank2]: self.engine.backward(loss, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]: ret_val = func(*args, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank2]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]: ret_val = func(*args, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank2]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank2]: scaled_loss.backward(retain_graph=retain_graph)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank2]: torch.autograd.backward(
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank2]: _engine_run_backward(
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank2]: return user_fn(self, *args)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank2]: raise RuntimeError(
[rank2]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank3]: launch()
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank3]: run_exp()
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank3]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank3]: return inner_training_loop(
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank3]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank3]: self.accelerator.backward(loss, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank3]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank3]: self.engine.backward(loss, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank3]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank3]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank3]: scaled_loss.backward(retain_graph=retain_graph)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank3]: torch.autograd.backward(
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank3]: _engine_run_backward(
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank3]: return user_fn(self, *args)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank3]: raise RuntimeError(
[rank3]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank0]: self.engine.backward(loss, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank0]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank0]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]: scaled_loss.backward(retain_graph=retain_graph)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank0]: return user_fn(self, *args)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
0%| | 0/1000 [00:11<?, ?it/s]
W1220 18:40:21.147000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2883573 closing signal SIGTERM
W1220 18:40:21.147000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2883574 closing signal SIGTERM
W1220 18:40:21.148000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2883576 closing signal SIGTERM
E1220 18:40:21.376000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 2883575) of binary: /usr/local/miniconda3/envs/llama_factory/bin/python
Traceback (most recent call last):
File "/usr/local/miniconda3/envs/llama_factory/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-20_18:40:21
host : gp-SYS-4029GP-TRT2-EC028B
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2883575)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Expected behavior

我通过这里的代码在workflow里面freeze的大部分参数

for lay_name, param in model.named_parameters():
     if lay_name in llama3_vl_train_layers:
         param.requires_grad = True
     else:
         param.requires_grad = False

我的代码在一块3090上训练可以,在6卡4090d的服务器上使用lora微调可以,但是冻结参数直接sft会有这个报错。

我已经将batchsize=1,cutoff_len=256,显存还是爆炸了,因此您发的那个FAQ是没法解决我这个问题。

我想知道这是什么原因,如何修改。是不是我冻结的参数在使用deepspeed的时候没有均匀分配到每张卡上?非常感谢!!!

Others

我的yaml参数如下

model

model_name_or_path: /home/Llama-3.2-11B-Vision-Instruct

trust_remote_code: true

method

stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

dataset

dataset: tenk

dataset: identity,alpaca_en_demo

template: mllama
cutoff_len: 256
max_samples: 800
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves/llama3-11B-Vision-Instruct/tenk/DCT_4_on_2k_5/vision_32/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 5
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
optim: paged_adamw_8bit

eval

val_size: 0.001
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant