You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello guys, I am try to fine tune a 7B Language model with huggingface transformers and deepspeed (stage 3 , 4 gpus).
By setting the log_level to "info", I can get deepspeed logging result.
When I set the --fp16 True \ for the transformers arguments and do a normal mixed precision training, the deepspeed logging info is
[2024-02-02 08:55:15,983] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.3, git-hash=unknown, git-branch=unknown`
[2024-02-02 08:55:15,988] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: True
[2024-02-02 08:55:15,989] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-02-02 08:55:15,989] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-02-02 08:55:15,996] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-02-02 08:55:15,996] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-02-02 08:55:15,996] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-02-02 08:55:15,996] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
[2024-02-02 08:55:16,069] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning
[2024-02-02 08:55:16,069] [INFO] [utils.py:803:see_memory_usage] MA 3.63 GB Max_MA 6.26 GB CA 5.45 GB Max_CA 7 GB
[2024-02-02 08:55:16,070] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.05 GB, percent = 10.6%
[2024-02-02 08:55:16,071] [INFO] [stage3.py:127:__init__] Reduce bucket size 500,000,000
[2024-02-02 08:55:16,071] [INFO] [stage3.py:128:__init__] Prefetch bucket size 50,000,000
[2024-02-02 08:55:16,135] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-02-02 08:55:16,135] [INFO] [utils.py:803:see_memory_usage] MA 3.63 GB Max_MA 3.63 GB CA 5.45 GB Max_CA 5 GB
[2024-02-02 08:55:16,136] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.06 GB, percent = 10.6%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2024-02-02 08:55:16,215] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-02-02 08:55:16,216] [INFO] [utils.py:803:see_memory_usage] MA 3.63 GB Max_MA 3.63 GB CA 5.45 GB Max_CA 5 GB
[2024-02-02 08:55:16,216] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.13 GB, percent = 10.6%
[2024-02-02 08:55:16,289] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions
[2024-02-02 08:55:16,290] [INFO] [utils.py:803:see_memory_usage] MA 3.63 GB Max_MA 3.63 GB CA 5.45 GB Max_CA 5 GB
[2024-02-02 08:55:16,290] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.1 GB, percent = 10.6%
[2024-02-02 08:55:20,338] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 2
[2024-02-02 08:55:20,339] [INFO] [utils.py:803:see_memory_usage] MA 3.63 GB Max_MA 3.63 GB CA 6.75 GB Max_CA 7 GB
[2024-02-02 08:55:20,339] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.03 GB, percent = 10.6%
[2024-02-02 08:55:20,401] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions
[2024-02-02 08:55:20,402] [INFO] [utils.py:803:see_memory_usage] MA 3.63 GB Max_MA 3.63 GB CA 6.75 GB Max_CA 7 GB
[2024-02-02 08:55:20,402] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.11 GB, percent = 10.6%
[2024-02-02 08:55:20,508] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2024-02-02 08:55:20,508] [INFO] [utils.py:803:see_memory_usage] MA 10.62 GB Max_MA 12.25 GB CA 15.61 GB Max_CA 16 GB
[2024-02-02 08:55:20,509] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.09 GB, percent = 10.6%
[2024-02-02 08:55:20,571] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2024-02-02 08:55:20,572] [INFO] [utils.py:803:see_memory_usage] MA 10.62 GB Max_MA 10.62 GB CA 15.61 GB Max_CA 16 GB
[2024-02-02 08:55:20,572] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.16 GB, percent = 10.6%
[2024-02-02 08:55:20,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | init_optimizer_state: 156.79
[2024-02-02 08:55:20,816] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2024-02-02 08:55:20,817] [INFO] [utils.py:803:see_memory_usage] MA 24.6 GB Max_MA 31.59 GB CA 37.06 GB Max_CA 37 GB
[2024-02-02 08:55:20,817] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 102.98 GB, percent = 10.6%
[2024-02-02 08:55:20,818] [INFO] [stage3.py:482:_setup_for_real_optimizer] optimizer state initialized
[2024-02-02 08:55:21,164] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2024-02-02 08:55:21,164] [INFO] [utils.py:803:see_memory_usage] MA 29.02 GB Max_MA 30.94 GB CA 37.06 GB Max_CA 37 GB
[2024-02-02 08:55:21,165] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.1 GB, percent = 10.6%
[2024-02-02 08:55:21,165] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2024-02-02 08:55:21,165] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-02-02 08:55:21,165] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-02-02 08:55:21,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.98)]
and everything looks well,
then when I test with full fp32 model by setting --fp16 False \ , the log info is
[2024-02-02 08:24:53,133] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning
[2024-02-02 08:24:53,133] [INFO] [utils.py:803:see_memory_usage] MA 7.18 GB Max_MA 11.01 GB CA 7.2 GB Max_CA 11 GB
[2024-02-02 08:24:53,134] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.15 GB, percent = 10.6%
[2024-02-02 08:24:53,135] [INFO] [stage3.py:127:__init__] Reduce bucket size 500,000,000
[2024-02-02 08:24:53,135] [INFO] [stage3.py:128:__init__] Prefetch bucket size 50,000,000
[2024-02-02 08:24:53,212] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-02-02 08:24:53,212] [INFO] [utils.py:803:see_memory_usage] MA 7.18 GB Max_MA 7.18 GB CA 7.2 GB Max_CA 7 GB
[2024-02-02 08:24:53,212] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.16 GB, percent = 10.6%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2024-02-02 08:24:53,302] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-02-02 08:24:53,303] [INFO] [utils.py:803:see_memory_usage] MA 7.18 GB Max_MA 7.18 GB CA 7.2 GB Max_CA 7 GB
[2024-02-02 08:24:53,303] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.26 GB, percent = 10.6%
[2024-02-02 08:24:53,385] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions
[2024-02-02 08:24:53,386] [INFO] [utils.py:803:see_memory_usage] MA 7.18 GB Max_MA 7.18 GB CA 7.2 GB Max_CA 7 GB
[2024-02-02 08:24:53,386] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.23 GB, percent = 10.6%
[2024-02-02 08:24:58,336] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 2
[2024-02-02 08:24:58,337] [INFO] [utils.py:803:see_memory_usage] MA 7.12 GB Max_MA 7.18 GB CA 10.15 GB Max_CA 10 GB
[2024-02-02 08:24:58,337] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 110.13 GB, percent = 11.3%
[2024-02-02 08:24:58,409] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions
[2024-02-02 08:24:58,410] [INFO] [utils.py:803:see_memory_usage] MA 7.12 GB Max_MA 7.12 GB CA 10.15 GB Max_CA 10 GB
[2024-02-02 08:24:58,410] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 110.12 GB, percent = 11.3%
[2024-02-02 08:24:58,515] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2024-02-02 08:24:58,516] [INFO] [utils.py:803:see_memory_usage] MA 14.11 GB Max_MA 14.11 GB CA 17.14 GB Max_CA 17 GB
[2024-02-02 08:24:58,516] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 110.15 GB, percent = 11.3%
[2024-02-02 08:25:02,335] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2024-02-02 08:25:02,336] [INFO] [utils.py:803:see_memory_usage] MA 14.11 GB Max_MA 14.11 GB CA 17.14 GB Max_CA 17 GB
[2024-02-02 08:25:02,336] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.19 GB, percent = 10.6%
[2024-02-02 08:25:02,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | init_optimizer_state: 149.28
[2024-02-02 08:25:02,579] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2024-02-02 08:25:02,579] [INFO] [utils.py:803:see_memory_usage] MA 28.09 GB Max_MA 35.09 GB CA 38.59 GB Max_CA 39 GB
[2024-02-02 08:25:02,580] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.26 GB, percent = 10.6%
[2024-02-02 08:25:02,580] [INFO] [stage3.py:482:_setup_for_real_optimizer] optimizer state initialized
[2024-02-02 08:25:03,116] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2024-02-02 08:25:03,117] [INFO] [utils.py:803:see_memory_usage] MA 36.95 GB Max_MA 40.78 GB CA 47.5 GB Max_CA 48 GB
[2024-02-02 08:25:03,117] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 103.16 GB, percent = 10.6%
[2024-02-02 08:25:03,117] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2024-02-02 08:25:03,117] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-02-02 08:25:03,117] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-02-02 08:25:03,117] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.98)]
In Stage 3 initialize beginning, we already have the fp32 part for each GPU, while during creating fp32 partitions, the memory cost is doubled, why it is created again here?
And I do not see any problem with mixed precision mode above, where we have the fp16 part for each GPU at initialize beginning, GPU memory cost dose not change after creating fp16 partitions.
So is this a BUG or have I misunderstood something?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello guys, I am try to fine tune a 7B Language model with huggingface transformers and deepspeed (stage 3 , 4 gpus).
By setting the log_level to "info", I can get deepspeed logging result.
When I set the --fp16 True \ for the transformers arguments and do a normal mixed precision training, the deepspeed logging info is
and everything looks well,
then when I test with full fp32 model by setting --fp16 False \ , the log info is
In Stage 3 initialize beginning, we already have the fp32 part for each GPU, while during creating fp32 partitions, the memory cost is doubled, why it is created again here?
And I do not see any problem with mixed precision mode above, where we have the fp16 part for each GPU at initialize beginning, GPU memory cost dose not change after creating fp16 partitions.
So is this a BUG or have I misunderstood something?
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions