Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ELECTRA/TF2] Option To Allow scripts/run_pretraining.sh To Use "wiki_only" #1319

Open
psharpe99 opened this issue Jun 30, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@psharpe99
Copy link

Related to ELECTRA/TF2

Is your feature request related to a problem? Please describe.
The README shows that the datasets can be created from wiki-only:
/workspace/electra/data/create_datasets_from_start.sh wiki_books
but when you then continue to pretrain using the README instruction
bash scripts/run_pretraining.sh
it complains about the file/directory not existing.
Looking at the run_pretraining.sh script, it has
DATASET_P1="tfrecord_lower_case_1_seq_len_128_random_seed_12345/books_wiki_en_corpus/train/pretrain_data*" # change this for other datasets
DATASET_P2="tfrecord_lower_case_1_seq_len_512_random_seed_12345/books_wiki_en_corpus/train/pretrain_data*" # change this for other datasets
which are preset to the books_wiki directory, with the comment that these need to be (manually) "changed" for other datasets (e.g. wiki-only)
Changing these manually to the 'wikicorpus_en' directory allowed the pretraining to succeed, but the script ideally shouldn't need editing.

Describe the solution you'd like
It should be a simple change to include a command-line option to the run_pretraining script for "wiki-only" .

Describe alternatives you've considered
Alternatively, it should be documented in the README that this script file needs to be editted if running only from wiki data.

Additional context
none

@psharpe99 psharpe99 added the enhancement New feature or request label Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant