You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, has the HF model been tested to train on CUDA? I am getting OOM errors no matter how small the batch size is. Im using a V100 32GB. A code snippet is attached below. I've profiled the individual steps of hf_model.train using nvidia-smi and narrowed down the issue to here. The GPU memory spikes to fill up 32GB after the dataset is loaded. Is all of the data being loaded onto the GPU? Is this supposed to happen? Is there a way to disable this behavior? The error message also supports this as PyTorch only reserved 2.8GB for the model itself.
import t5.data.mixtures
import functools
import t5.models
import seqio
import torch
import tensorflow_datasets as tfds
from transformers import Adafactor
model = t5.models.HfPyTorchModel("google/t5-v1_1-base", "/tmp" , torch.device("cuda"))
TaskRegistry = seqio.TaskRegistry
for b in tfds.text.glue.Glue.builder_configs.values():
task = TaskRegistry.get("glue_%s_v002" % b.name)
task.source._tfds_dataset._name = task.source._tfds_dataset._name.replace("1.0.0", "2.0.0")
model.train(
"glue_v002_proportional",
262144,
5000
{"inputs": 512, "targets": 512},
"train",
16,
functools.partial(Adafactor, lr=1e-3, relative_step=False),
)
OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 31.75 GiB total capacity;
2.71 GiB already allocated; 45.75 MiB free; 2.79 GiB reserved in total by PyTorch) If reserved
memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
The text was updated successfully, but these errors were encountered:
Hi, has the HF model been tested to train on CUDA? I am getting OOM errors no matter how small the batch size is. Im using a V100 32GB. A code snippet is attached below. I've profiled the individual steps of
hf_model.train
usingnvidia-smi
and narrowed down the issue to here. The GPU memory spikes to fill up 32GB after the dataset is loaded. Is all of the data being loaded onto the GPU? Is this supposed to happen? Is there a way to disable this behavior? The error message also supports this as PyTorch only reserved 2.8GB for the model itself.The text was updated successfully, but these errors were encountered: