Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's an AlignModel or Deepspeed Zero3 bug. #28808

Open
2 of 4 tasks
Tracked by #33345
necrophagists opened this issue Feb 1, 2024 · 6 comments · May be fixed by #33632
Open
2 of 4 tasks
Tracked by #33345

It's an AlignModel or Deepspeed Zero3 bug. #28808

necrophagists opened this issue Feb 1, 2024 · 6 comments · May be fixed by #33632

Comments

@necrophagists
Copy link

necrophagists commented Feb 1, 2024

System Info

When I try to load the AlignModel weights locally and train them using zero3, I get the following error:

 File "/opt/licy/MyVLM/model/builder.py", line 152, in load_model
    model =AlignModel.from_pretrained(self.args.vm_path)
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 3559, in _load_pretrained_model
    model.apply(model._initialize_weights)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 885, in apply
    fn(self)
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1388, in _initialize_weights
    self._init_weights(module)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/align/modeling_align.py", line 1189, in _init_weights
    nn.init.xavier_uniform_(module.text_projection.weight)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/init.py", line 323, in xavier_uniform_
    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/init.py", line 287, in _calculate_fan_in_and_fan_out
    raise ValueError("Fan in and fan out can not be computed for tensor with fewer than 2 dimensions")

Switching to zero2 doesn't produce an error; also, ConvnextModel and ClipVisionModel don't report an error when trained under zero3, so I'm thinking that maybe there's a bug in AlignModel?
@amyeroberts @pacman100 @muellerz

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

1.model =AlignModel.from_pretrained(path)
2.use zero3 to train model
3.get error about xavier_init

Expected behavior

The expected behavior is to be able to load models

@ArthurZucker
Copy link
Collaborator

Hey! Could you give a reproducible snippet? 🤗

@necrophagists
Copy link
Author

Hey! Could you give a reproducible snippet? 🤗

Sorry, this is a company project so I can't provide you with the relevant code. I recently located here in modeling_utils.py

        if is_deepspeed_zero3_enabled():
            import deepspeed
            logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")
            init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts

this should be the place that is causing the error to be reported, in addition I've noticed that the shape of the module.text_projection.weight is 0 when the error is being reported (normally it's 480×680). Can you give some clues?

Here's my zero3 config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

@huggingface huggingface deleted a comment from github-actions bot Mar 4, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Apr 6, 2024
@amyeroberts amyeroberts reopened this Apr 8, 2024
@huggingface huggingface deleted a comment from ArthurZucker Apr 8, 2024
@amyeroberts amyeroberts reopened this Apr 17, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @pacman100 as this looks possibly deepspeed related

@huggingface huggingface deleted a comment from github-actions bot May 13, 2024
@huggingface huggingface deleted a comment from github-actions bot Jun 7, 2024
@amyeroberts
Copy link
Collaborator

cc @SunMarc @muellerzr

@huggingface huggingface deleted a comment from github-actions bot Jul 2, 2024
@huggingface huggingface deleted a comment from github-actions bot Aug 1, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @SunMarc @muellerzr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants