-
Notifications
You must be signed in to change notification settings - Fork 830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full parameter fine-tuning bugs #864
Comments
Additionally, it works when I use LLaMA3-8B. (16-bit train: False) |
From your log, it seems that you're using an old CUDA (11.2), we recommend to use 11.8 or 12.0 to avoid other potential issues.
|
For 2, I also noticed that you're using dsz3 offload config:
Maybe try
|
I got this bug.
bash run.sh [2024-06-20 23:47:57,121] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:00,497] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7: setting --include=localhost:0,1,2,3,4,5,6,7 [2024-06-20 23:48:00,497] [INFO] [runner.py:555:main] cmd = usr/anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None usr/LMFlow/examples/finetune.py --model_name_or_path usr/huggingface/hub/LLM-Research/Meta-Llama-3-70B-Instruct --trust_remote_code True --dataset_path usr/LMFlow/data/mbpp/train_conversation --output_dir output_models/mbpp_full --overwrite_output_dir --conversation_template llama3 --num_train_epochs 3 --learning_rate 2e-5 --disable_group_texts 1 --block_size 1024 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name mbpp_full --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 32 [2024-06-20 23:48:01,792] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:03,210] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eth2 [2024-06-20 23:48:03,210] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2024-06-20 23:48:03,210] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 [2024-06-20 23:48:03,210] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2024-06-20 23:48:03,210] [INFO] [launch.py:163:main] dist_world_size=8 [2024-06-20 23:48:03,210] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2024-06-20 23:48:09,949] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,954] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,954] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,960] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( [2024-06-20 23:48:17,325] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,325] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,325] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,325] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,325] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,326] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,326] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,326] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [2024-06-20 23:48:17,326] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 7, device: cuda:7, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 3, device: cuda:3, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 5, device: cuda:5, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 4, device: cuda:4, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 6, device: cuda:6, n_gpu: 1,distributed training: True, 16-bits training: False usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( [WARNING|logging.py:329] 2024-06-20 23:48:18,533 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the
legacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set
legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,533 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the
legacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set
legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,535 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the
legacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set
legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,535 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the
legacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set
legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,537 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the
legacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set
legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,540 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the
legacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set
legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,544 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the
legacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set
legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,588 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the
legacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set
legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565[WARNING|logging.py:314] 2024-06-20 23:48:18,808 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-06-20 23:48:18,811 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-06-20 23:48:18,818 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-06-20 23:48:18,819 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-06-20 23:48:18,825 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-06-20 23:48:18,833 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-06-20 23:48:18,833 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-06-20 23:48:18,860 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-06-20 23:48:22,245] [INFO] [partition_parameters.py:326:exit] finished initializing model with 70.55B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [04:24<00:00, 8.80s/it]
06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3')
06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3')
06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3')
06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3')
06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3')
06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3')
06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3')
06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3')
06/20/2024 23:52:48 - WARNING - accelerate.utils.other - Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combinationInstalled CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Using usr/torch_extensions/py39_cu118 as PyTorch extensions root...
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Using usr/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file usr/torch_extensions/py39_cu118/cpu_adam/build.ninja...
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8547587394714355 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.794377326965332 seconds
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Using usr/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file usr/torch_extensions/py39_cu118/cpu_adam/build.ninja...
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8901524543762207 seconds
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Using usr/torch_extensions/py39_cu118 as PyTorch extensions root...
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Using usr/torch_extensions/py39_cu118 as PyTorch extensions root...
Using usr/torch_extensions/py39_cu118 as PyTorch extensions root...
Using usr/torch_extensions/py39_cu118 as PyTorch extensions root...
Using usr/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file usr/torch_extensions/py39_cu118/cpu_adam/build.ninja...
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.5710041522979736 seconds
Loading extension module cpu_adam...Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.6291885375976562 seconds
Time to load cpu_adam op: 3.6292471885681152 seconds
Time to load cpu_adam op: 3.62994384765625 seconds
Time to load cpu_adam op: 3.630335807800293 seconds
Parameter Offload: Total persistent parameters: 1318912 in 161 params
wandb: Currently logged in as: · (·). Use
wandb login --relogin
to force reloginwandb: wandb version 0.17.2 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.14.0
wandb: Run data is saved locally in usr/LMFlow/wandb/run-20240620_235342-btcnpc4h
wandb: Run
wandb offline
to turn off syncing.wandb: Syncing run mbpp_full
wandb: ⭐️ View project at https://wandb.ai/·/huggingface
wandb: 🚀 View run at https://wandb.ai/·/huggingface/runs/btcnpc4h
0%| | 0/120 [00:00<?, ?it/s][2024-06-20 23:53:53,238] [WARNING] [parameter_offload.py:86:apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/psutil/init.py:2008: RuntimeWarning: available memory stats couldn't be determined and was set to 0
ret = _psplatform.virtual_memory()
[2024-06-20 23:57:07,420] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29901
[2024-06-20 23:57:19,311] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29902
[2024-06-20 23:57:38,331] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29903
[2024-06-20 23:57:52,665] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29904
[2024-06-20 23:57:52,665] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29905
[2024-06-20 23:58:06,209] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29906
[2024-06-20 23:58:19,260] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29907
[2024-06-20 23:58:33,505] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29971
[2024-06-20 23:58:49,679] [ERROR] [launch.py:321:sigkill_handler] ['usr/anaconda3/envs/lmflow/bin/python', '-u', 'usr/LMFlow/examples/finetune.py', '--local_rank=7', '--model_name_or_path', 'usr/huggingface/hub/LLM-Research/Meta-Llama-3-70B-Instruct', '--trust_remote_code', 'True', '--dataset_path', 'usr/LMFlow/data/mbpp/train_conversation', '--output_dir', 'output_models/mbpp_full', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '3', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '1024', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'mbpp_full', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '32'] exits with return code = -9
`
And my run_finetune.sh is:
I have two questions:
The text was updated successfully, but these errors were encountered: