You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to load a HuggingFace model, but whatever batch_size I give it, it throws CUDA memmory error!
I used t5-small even, and hadn't such problem with much higher models.
it occurs after saveing first checkpoint!
2022-03-06 20:15:01,951:INFO:absl: Loading from /home/pouramini/t5logs/last-large/sup/HF810_lr_0.001_0/model-0.checkpoint
============================ Training =========================
2022-03-06 20:15:02,050:INFO:absl: Loading from /home/pouramini/pret/t5-small/model-0.checkpoint
2022-03-06 20:15:02,111:WARNING:absl: _sup is both a Task and a Mixture, returning Mixture
2022-03-06 20:15:02.112207: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.114921: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.115217: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.115631: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-06 20:15:02.115902: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.116195: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.116464: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.300687: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.301026: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.301306: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-06 20:15:02.301572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6211 MB memory: -> device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
2022-03-06 20:15:02,368:INFO:absl: Automatically caching small dataset in memory: '_sup:train'
2022-03-06 20:15:02,785:WARNING:absl: _sup is both a Task and a Mixture, returning Mixture
2022-03-06 20:15:03,057:INFO:absl: Saving checkpoint for step 0
Traceback (most recent call last):
File "/home/pouramini/rainbow/bin/fine-tune.py", line 553, in <module>
fine_tune()
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/pouramini/rainbow/bin/fine-tune.py", line 481, in fine_tune
model.finetune(
File "/home/pouramini/text-to-text-transfer-transformer/t5/models/hf_model.py", line 572, in finetune
self.train(mixture_or_task_name, finetune_steps, **train_kwargs)
File "/home/pouramini/text-to-text-transfer-transformer/t5/models/hf_model.py", line 341, in train
outputs = self._model(
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1612, in forward
decoder_outputs = self.decoder(
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1005, in forward
layer_outputs = layer_module(
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 666, in forward
cross_attention_outputs = self.layer[1](
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 581, in forward
attention_output = self.EncDecAttention(
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pouramini/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 518, in forward
attn_output = unshape(torch.matmul(attn_weights, value_states)) # (batch_size, seq_length, dim)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.93 GiB total capacity; 936.32 MiB already allocated; 13.44 MiB free; 950.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2022-03-06 20:15:04.744503: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
The text was updated successfully, but these errors were encountered:
puraminy
changed the title
Which checkpoint should I use for HF model?
How checkpoint should I use for HF model? For any batch_size it gives cuda out of memmory error.
Mar 6, 2022
puraminy
changed the title
How checkpoint should I use for HF model? For any batch_size it gives cuda out of memmory error.
How should I use for HF model? For any batch_size it gives cuda out of memmory error.
Mar 7, 2022
puraminy
changed the title
How should I use for HF model? For any batch_size it gives cuda out of memmory error.
How should I use an HF model? For any batch_size it gives cuda out of memmory error.
Mar 7, 2022
I am trying to load a HuggingFace model, but whatever batch_size I give it, it throws CUDA memmory error!
I used t5-small even, and hadn't such problem with much higher models.
it occurs after saveing first checkpoint!
The text was updated successfully, but these errors were encountered: