You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a real issue but a question. If you enable discussion sections of the repo, it'd fit there better:
I’m wondering what are the optimal configurations for running the experiments. I tried running train_language_agent.py using almost the same configuration as in experiments/configs/multi-node_slurm_cluster_config.yaml and experiments/campaign/Mixed_training/GFlan-T5_large.slurm on 8x A100 80GB GPUs but it was slow (almost 2 frames per second) and when I try to play with configurations to improve speed, for example increasing the mini-batch size, it faces the Cuda out of memory error or errors related to all-NaN tensors (I guess vanishing gradients?). So, it’d be appreciated if you could provide some hints about what configuration on what hardware yields results similar to the paper, and with what speed (frames per second), please?
The text was updated successfully, but these errors were encountered:
Hi @ClementRomac,
This is not a real issue but a question. If you enable discussion sections of the repo, it'd fit there better:
I’m wondering what are the optimal configurations for running the experiments. I tried running
train_language_agent.py
using almost the same configuration as inexperiments/configs/multi-node_slurm_cluster_config.yaml
andexperiments/campaign/Mixed_training/GFlan-T5_large.slurm
on 8x A100 80GB GPUs but it was slow (almost 2 frames per second) and when I try to play with configurations to improve speed, for example increasing the mini-batch size, it faces the Cuda out of memory error or errors related to all-NaN tensors (I guess vanishing gradients?). So, it’d be appreciated if you could provide some hints about what configuration on what hardware yields results similar to the paper, and with what speed (frames per second), please?The text was updated successfully, but these errors were encountered: