It's an AlignModel or Deepspeed Zero3 bug. #28808

necrophagists · 2024-02-01T06:18:02Z

System Info

When I try to load the AlignModel weights locally and train them using zero3, I get the following error：

 File "/opt/licy/MyVLM/model/builder.py", line 152, in load_model
    model =AlignModel.from_pretrained(self.args.vm_path)
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 3559, in _load_pretrained_model
    model.apply(model._initialize_weights)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 885, in apply
    fn(self)
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1388, in _initialize_weights
    self._init_weights(module)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/align/modeling_align.py", line 1189, in _init_weights
    nn.init.xavier_uniform_(module.text_projection.weight)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/init.py", line 323, in xavier_uniform_
    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/init.py", line 287, in _calculate_fan_in_and_fan_out
    raise ValueError("Fan in and fan out can not be computed for tensor with fewer than 2 dimensions")

Switching to zero2 doesn't produce an error; also, ConvnextModel and ClipVisionModel don't report an error when trained under zero3, so I'm thinking that maybe there's a bug in AlignModel?
@amyeroberts @pacman100 @muellerz

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

1.model =AlignModel.from_pretrained(path)
2.use zero3 to train model
3.get error about xavier_init

Expected behavior

The expected behavior is to be able to load models

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-02-01T13:11:45Z

Hey! Could you give a reproducible snippet? 🤗

necrophagists · 2024-02-02T03:56:28Z

Hey! Could you give a reproducible snippet? 🤗

Sorry, this is a company project so I can't provide you with the relevant code. I recently located here in modeling_utils.py：

        if is_deepspeed_zero3_enabled():
            import deepspeed
            logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")
            init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts

this should be the place that is causing the error to be reported, in addition I've noticed that the shape of the module.text_projection.weight is 0 when the error is being reported (normally it's 480×680). Can you give some clues?

Here's my zero3 config：

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

github-actions · 2024-03-29T08:04:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts · 2024-04-17T17:28:52Z

Gentle ping @pacman100 as this looks possibly deepspeed related

amyeroberts · 2024-06-07T08:49:29Z

cc @SunMarc @muellerzr

amyeroberts · 2024-08-01T09:47:37Z

Gentle ping @SunMarc @muellerzr

huggingface deleted a comment from github-actions bot Mar 4, 2024

github-actions bot closed this as completed Apr 6, 2024

amyeroberts reopened this Apr 8, 2024

huggingface deleted a comment from ArthurZucker Apr 8, 2024

github-actions bot closed this as completed Apr 17, 2024

amyeroberts reopened this Apr 17, 2024

huggingface deleted a comment from github-actions bot May 13, 2024

huggingface deleted a comment from github-actions bot Jun 7, 2024

huggingface deleted a comment from github-actions bot Jul 2, 2024

huggingface deleted a comment from github-actions bot Aug 1, 2024

huggingface deleted a comment from github-actions bot Aug 26, 2024

amyeroberts added the DeepSpeed label Aug 27, 2024

ArthurZucker mentioned this issue Sep 6, 2024

Accelerate x Trainer issue tracker: #33345

Open

37 tasks

Ben-Schneider-code linked a pull request Sep 20, 2024 that will close this issue

Fix module initialization for root module under Zero3 #33632

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It's an AlignModel or Deepspeed Zero3 bug. #28808

It's an AlignModel or Deepspeed Zero3 bug. #28808

necrophagists commented Feb 1, 2024 •

edited

Loading

ArthurZucker commented Feb 1, 2024

necrophagists commented Feb 2, 2024

github-actions bot commented Mar 29, 2024

amyeroberts commented Apr 17, 2024

amyeroberts commented Jun 7, 2024

amyeroberts commented Aug 1, 2024

It's an AlignModel or Deepspeed Zero3 bug. #28808

It's an AlignModel or Deepspeed Zero3 bug. #28808

Comments

necrophagists commented Feb 1, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Feb 1, 2024

necrophagists commented Feb 2, 2024

github-actions bot commented Mar 29, 2024

amyeroberts commented Apr 17, 2024

amyeroberts commented Jun 7, 2024

amyeroberts commented Aug 1, 2024

necrophagists commented Feb 1, 2024 •

edited

Loading