Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should I use an HF model? For any batch_size it gives cuda out of memmory error. #991

Open
puraminy opened this issue Mar 6, 2022 · 1 comment

Comments

@puraminy
Copy link

puraminy commented Mar 6, 2022

I am trying to load a HuggingFace model, but whatever batch_size I give it, it throws CUDA memmory error!
I used t5-small even, and hadn't such problem with much higher models.

it occurs after saveing first checkpoint!

2022-03-06 20:15:01,951:INFO:absl: Loading from /home/pouramini/t5logs/last-large/sup/HF810_lr_0.001_0/model-0.checkpoint
============================ Training =========================
2022-03-06 20:15:02,050:INFO:absl: Loading from /home/pouramini/pret/t5-small/model-0.checkpoint
2022-03-06 20:15:02,111:WARNING:absl: _sup is both a Task and a Mixture, returning Mixture
2022-03-06 20:15:02.112207: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.114921: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.115217: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.115631: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-06 20:15:02.115902: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.116195: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.116464: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.300687: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.301026: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.301306: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.301572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6211 MB memory:  -> device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
2022-03-06 20:15:02,368:INFO:absl: Automatically caching small dataset in memory: '_sup:train'
2022-03-06 20:15:02,785:WARNING:absl: _sup is both a Task and a Mixture, returning Mixture
2022-03-06 20:15:03,057:INFO:absl: Saving checkpoint for step 0
Traceback (most recent call last):
  File "/home/pouramini/rainbow/bin/fine-tune.py", line 553, in <module>
    fine_tune()
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/pouramini/rainbow/bin/fine-tune.py", line 481, in fine_tune
    model.finetune(
  File "/home/pouramini/text-to-text-transfer-transformer/t5/models/hf_model.py", line 572, in finetune
    self.train(mixture_or_task_name, finetune_steps, **train_kwargs)
  File "/home/pouramini/text-to-text-transfer-transformer/t5/models/hf_model.py", line 341, in train
    outputs = self._model(
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1612, in forward
    decoder_outputs = self.decoder(
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1005, in forward
    layer_outputs = layer_module(
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 666, in forward
    cross_attention_outputs = self.layer[1](
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 581, in forward
    attention_output = self.EncDecAttention(
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pouramini/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 518, in forward
    attn_output = unshape(torch.matmul(attn_weights, value_states))  # (batch_size, seq_length, dim)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.93 GiB total capacity; 936.32 MiB already allocated; 13.44 MiB free; 950.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2022-03-06 20:15:04.744503: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
@puraminy puraminy changed the title Which checkpoint should I use for HF model? How checkpoint should I use for HF model? For any batch_size it gives cuda out of memmory error. Mar 6, 2022
@puraminy puraminy changed the title How checkpoint should I use for HF model? For any batch_size it gives cuda out of memmory error. How should I use for HF model? For any batch_size it gives cuda out of memmory error. Mar 7, 2022
@puraminy puraminy changed the title How should I use for HF model? For any batch_size it gives cuda out of memmory error. How should I use an HF model? For any batch_size it gives cuda out of memmory error. Mar 7, 2022
@yfqiu98
Copy link

yfqiu98 commented Apr 16, 2022

any update on this issue? thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants