Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume_from_checkpoint oom killed #6486

Open
1 task done
sunrise224 opened this issue Dec 30, 2024 · 2 comments
Open
1 task done

resume_from_checkpoint oom killed #6486

sunrise224 opened this issue Dec 30, 2024 · 2 comments
Labels
pending This problem is yet to be addressed

Comments

@sunrise224
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.0
  • Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
  • Python version: 3.10.14
  • PyTorch version: 2.4.0+cu121 (GPU)
  • Transformers version: 4.44.2
  • Datasets version: 2.21.0
  • Accelerate version: 0.33.0
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA H100 80GB HBM3
  • DeepSpeed version: 0.14.4
  • Bitsandbytes version: 0.43.3
  • vLLM version: 0.6.0

Reproduction

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch
--config_file examples/accelerate/fsdp_config.yaml
src/train.py examples/extras/fsdp_qlora/llama3_lora_sft3.yaml

llama3_lora_sft3.yaml:

model

model_name_or_path: meta-llama/Meta-Llama-3-70B-Instruct
resume_from_checkpoint: /sft/3-70b-2w-1/checkpoint-800

method

stage: sft
do_train: true
finetuning_type: full
flash_attn: fa2

dataset

dataset: sft_training_data_2w
template: llama3
cutoff_len: 4096
overwrite_cache: false
preprocessing_num_workers: 4

output

output_dir: saves/taught/sft/3-70b-2w-1
logging_steps: 10
save_steps: 200
plot_loss: true
overwrite_output_dir: false
max_length: 8192

train

per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1.0e-6
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_steps: 100
bf16: true
ddp_timeout: 180000000
seed: 2

eval

val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.

oom killed

Expected behavior

期望能够从断点续训。只在原始代码的基础上加了resume_from_checkpoint: /sft/3-70b-2w-1/checkpoint-800
一直报错oom 无法开始训练。没有加resume_from_checkpoint: /sft/3-70b-2w-1/checkpoint-800从头训的话不会报错oom(但是训了一段时间后也有可能oom)情况基本与https://github.com/hiyouga/LLaMA-Factory/issues/5771相同

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 30, 2024
@sunrise224
Copy link
Author

更新

发现是因为ckpt大小翻倍了 被保存成f32 所以导致oom
参考#6244
不过很奇怪 我没有设置保存成f32 他却保存成了32 这是默认值吗?

@sunrise224
Copy link
Author

sunrise224 commented Jan 3, 2025

更新

即便我通过export将ckpt转换成了bf16 safetensors总文件大小和seedmodel一样 resume from ckpt还是会oom 但是从seed model从头训不会。 估计是加载优化器梯度什么的导致oom了(seedmodel的训练有几张卡也接近爆显存了)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant