You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are gonna be using The Pile as our pre-training dataset. We need to get The Pile into SeqIO to be able to use properly with T5X, and pre-process/cache it for pre-training.
Action items
Download The Pile somewhere accessible to our TPUs/pre-processing machine;
Create a SeqIO dataset for The Pile;
Create two options for pre-processing: for causal language modelling and for masked language modelling with denoising (like in T5);
Run the pre-processing and caching (Adam Roberts can help on this).
Note: we may not need The Pile in its entirety. We are targeting runs ~30GT, so probably that ~50-100GT is fine if it's easier/faster to do.
The text was updated successfully, but these errors were encountered:
Download The Pile somewhere accessible to our TPUs/pre-processing machine;
It should probably be in the bigscience GCS bucket.
Create a SeqIO dataset for The Pile;
Create two options for pre-processing: for causal language modelling and for masked language modelling with denoising (like in T5);
Description
We are gonna be using The Pile as our pre-training dataset. We need to get The Pile into SeqIO to be able to use properly with T5X, and pre-process/cache it for pre-training.
Action items
Note: we may not need The Pile in its entirety. We are targeting runs ~30GT, so probably that ~50-100GT is fine if it's easier/faster to do.
The text was updated successfully, but these errors were encountered: