You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
单机2卡训练报错:
Traceback (most recent call last):
File "/home/d00620160/local/project/TencentPretrain/pretrain.py", line 139, in
main()
File "/home/d00620160/local/project/TencentPretrain/pretrain.py", line 135, in main
trainer.train_and_validate(args)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 147, in train_and_validate
worker(args.local_rank, None, args)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 732, in worker
trainer.train(args, local_rank, global_rank, train_loader, model_for_training, optimizer, scheduler)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 193, in train
batch = list(next(loader_iter))
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/utils/dataloader.py", line 187, in iter
yield torch.LongTensor(src),
TypeError: an integer is required (got type NoneType)
单机2卡训练报错:
Traceback (most recent call last):
File "/home/d00620160/local/project/TencentPretrain/pretrain.py", line 139, in
main()
File "/home/d00620160/local/project/TencentPretrain/pretrain.py", line 135, in main
trainer.train_and_validate(args)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 147, in train_and_validate
worker(args.local_rank, None, args)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 732, in worker
trainer.train(args, local_rank, global_rank, train_loader, model_for_training, optimizer, scheduler)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 193, in train
batch = list(next(loader_iter))
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/utils/dataloader.py", line 187, in iter
yield torch.LongTensor(src),
TypeError: an integer is required (got type NoneType)
训练命令如下:
CUDA_VISIBLE_DEVICES=6,7 deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 --pretrained_model_path models/llama2-7b.bin --dataset_path llama_support.pt --spm_model_path models/llama/tokenizer.model --config_path models/llama/7b_config.json --output_model_path models/llama_support_7b_dpw.bin --world_size 2 --gpu_ranks 0 1 --data_processor lm --deepspeed_checkpoint_activations --total_steps 300000 --save_checkpoint_steps 5000 --batch_size 1
这个错误的意思是数据有问题吗? 还是模型加载的有问题?
The text was updated successfully, but these errors were encountered: