Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare The Pile for use with T5X #1

Closed
4 tasks
slippylolo opened this issue Nov 4, 2021 · 3 comments
Closed
4 tasks

Prepare The Pile for use with T5X #1

slippylolo opened this issue Nov 4, 2021 · 3 comments
Assignees

Comments

@slippylolo
Copy link

slippylolo commented Nov 4, 2021

Description

We are gonna be using The Pile as our pre-training dataset. We need to get The Pile into SeqIO to be able to use properly with T5X, and pre-process/cache it for pre-training.

Action items

  • Download The Pile somewhere accessible to our TPUs/pre-processing machine;
  • Create a SeqIO dataset for The Pile;
  • Create two options for pre-processing: for causal language modelling and for masked language modelling with denoising (like in T5);
  • Run the pre-processing and caching (Adam Roberts can help on this).

Note: we may not need The Pile in its entirety. We are targeting runs ~30GT, so probably that ~50-100GT is fine if it's easier/faster to do.

@craffel
Copy link

craffel commented Nov 4, 2021

Download The Pile somewhere accessible to our TPUs/pre-processing machine;

It should probably be in the bigscience GCS bucket.

Create a SeqIO dataset for The Pile;
Create two options for pre-processing: for causal language modelling and for masked language modelling with denoising (like in T5);

See https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/tasks.py#L45 and https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/tasks.py#L66 - you should only need to alter the source.

@adarob
Copy link

adarob commented Nov 4, 2021

I haven't run SeqIO caching on public Beam myself, but here is an example of someone who has: google/seqio#109

It would be great to improve the SeqIO documentation if similar issues are hit.

@thomasw21 thomasw21 self-assigned this Nov 5, 2021
@thomasw21
Copy link
Member

Closing as we'll be using c4 instead. (Thanks @adarob for preprocessing it for us). Will re-open if we want to switch back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants