RuntimeError: mat1 and mat2 must have the same dtype #207

jiang-gy · 2024-10-17T03:34:25Z

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug
使用V100训练，关闭Flash Attention，使用fp32的类型训练，出现以上问题
Reproduction

What command or script did you run?

 PYTHONPATH=./ sh hydit/train.sh --index-file dataset/porcelain/jsons/porcelain.json --infer-mode torch

Environment

sys.platform: linux
Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: Tesla V100-SXM3-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.128
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 1.13.1
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2022.1-Product Build 20220311 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.14.1+cu117

Error traceback
If applicable, paste the error trackback here.

Traceback (most recent call last):
  File "hydit/train_deepspeed.py", line 529, in <module>
    main(get_args())
  File "hydit/train_deepspeed.py", line 461, in main
    loss_dict = diffusion.training_losses(model=model, x_start=latents, model_kwargs=model_kwargs)
  File "/workspace/HunyuanDiT/hydit/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/workspace/HunyuanDiT/hydit/diffusion/gaussian_diffusion.py", line 551, in training_losses
    out_dict = model(x_t, t, **model_kwargs)
  File "/workspace/HunyuanDiT/hydit/diffusion/respace.py", line 144, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/root/anaconda/ls/envs/HunyuanDiT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda/ls/envs/HunyuanDiT/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/root/anaconda/ls/envs/HunyuanDiT/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1568, in forward
    loss = self.module(*inputs, **kwargs)
  File "/root/anaconda/ls/envs/HunyuanDiT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/HunyuanDiT/hydit/modules/models.py", line 341, in forward
    text_states_t5 = self.mlp_t5(text_states_t5.view(-1, c_t5)).float()
  File "/root/anaconda/ls/envs/HunyuanDiT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda/ls/envs/HunyuanDiT/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/root/anaconda/ls/envs/HunyuanDiT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda/ls/envs/HunyuanDiT/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype

Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: mat1 and mat2 must have the same dtype #207

RuntimeError: mat1 and mat2 must have the same dtype #207

jiang-gy commented Oct 17, 2024

RuntimeError: mat1 and mat2 must have the same dtype #207

RuntimeError: mat1 and mat2 must have the same dtype #207

Comments

jiang-gy commented Oct 17, 2024