Describe the bug
When use ZERO++ and zero_hpz_partition_size is set, The loss of the first step or the first step after load checkpoint is too high.
To Reproduce
Qwen3 SFT train. stage3, one node, 8 gpu per node, zero_hpz_partition_size is set to 4.
Screenshots
Two experiment:
base: not set zero_hpz_partition_size
bug: set zero_hpz_partition_size to 4. And save checkpoint at step50, then load from this checkpoint
Docker context
base on image: ghcr.io/pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
python: 3.11.13
deepspeed: 0.16.9