resume_from_checkpoint oom killed #6486

sunrise224 · 2024-12-30T08:22:38Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.0
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.10.14
PyTorch version: 2.4.0+cu121 (GPU)
Transformers version: 4.44.2
Datasets version: 2.21.0
Accelerate version: 0.33.0
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA H100 80GB HBM3
DeepSpeed version: 0.14.4
Bitsandbytes version: 0.43.3
vLLM version: 0.6.0

Reproduction

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch
--config_file examples/accelerate/fsdp_config.yaml
src/train.py examples/extras/fsdp_qlora/llama3_lora_sft3.yaml

llama3_lora_sft3.yaml:

model

model_name_or_path: meta-llama/Meta-Llama-3-70B-Instruct
resume_from_checkpoint: /sft/3-70b-2w-1/checkpoint-800

method

stage: sft
do_train: true
finetuning_type: full
flash_attn: fa2

dataset

dataset: sft_training_data_2w
template: llama3
cutoff_len: 4096
overwrite_cache: false
preprocessing_num_workers: 4

output

output_dir: saves/taught/sft/3-70b-2w-1
logging_steps: 10
save_steps: 200
plot_loss: true
overwrite_output_dir: false
max_length: 8192

train

per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1.0e-6
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_steps: 100
bf16: true
ddp_timeout: 180000000
seed: 2

eval

val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.

oom killed

Expected behavior

期望能够从断点续训。只在原始代码的基础上加了resume_from_checkpoint: /sft/3-70b-2w-1/checkpoint-800
一直报错oom 无法开始训练。没有加resume_from_checkpoint: /sft/3-70b-2w-1/checkpoint-800从头训的话不会报错oom（但是训了一段时间后也有可能oom）情况基本与https://github.com/hiyouga/LLaMA-Factory/issues/5771相同

Others

No response

The text was updated successfully, but these errors were encountered:

sunrise224 · 2024-12-31T07:57:23Z

更新

发现是因为ckpt大小翻倍了被保存成f32 所以导致oom
参考#6244
不过很奇怪我没有设置保存成f32 他却保存成了32 这是默认值吗？

sunrise224 · 2025-01-03T02:20:03Z

更新

即便我通过export将ckpt转换成了bf16 safetensors总文件大小和seedmodel一样 resume from ckpt还是会oom 但是从seed model从头训不会。估计是加载优化器梯度什么的导致oom了（seedmodel的训练有几张卡也接近爆显存了）

github-actions bot added the pending This problem is yet to be addressed label Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume_from_checkpoint oom killed #6486

resume_from_checkpoint oom killed #6486

sunrise224 commented Dec 30, 2024

sunrise224 commented Dec 31, 2024

sunrise224 commented Jan 3, 2025 •

edited

Loading

resume_from_checkpoint oom killed #6486

resume_from_checkpoint oom killed #6486

Comments

sunrise224 commented Dec 30, 2024

Reminder

System Info

Reproduction

model

method

dataset

output

train

eval

Expected behavior

Others

sunrise224 commented Dec 31, 2024

sunrise224 commented Jan 3, 2025 • edited Loading

sunrise224 commented Jan 3, 2025 •

edited

Loading