You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The README shows that the datasets can be created from wiki-only:
/workspace/electra/data/create_datasets_from_start.sh wiki_books
but when you then continue to pretrain using the README instruction
bash scripts/run_pretraining.sh
it complains about the file/directory not existing.
Looking at the run_pretraining.sh script, it has
DATASET_P1="tfrecord_lower_case_1_seq_len_128_random_seed_12345/books_wiki_en_corpus/train/pretrain_data*" # change this for other datasets
DATASET_P2="tfrecord_lower_case_1_seq_len_512_random_seed_12345/books_wiki_en_corpus/train/pretrain_data*" # change this for other datasets
which are preset to the books_wiki directory, with the comment that these need to be (manually) "changed" for other datasets (e.g. wiki-only)
Changing these manually to the 'wikicorpus_en' directory allowed the pretraining to succeed, but the script ideally shouldn't need editing.
Describe the solution you'd like
It should be a simple change to include a command-line option to the run_pretraining script for "wiki-only" .
Describe alternatives you've considered
Alternatively, it should be documented in the README that this script file needs to be editted if running only from wiki data.
Additional context
none
The text was updated successfully, but these errors were encountered:
Related to ELECTRA/TF2
Is your feature request related to a problem? Please describe.
The README shows that the datasets can be created from wiki-only:
/workspace/electra/data/create_datasets_from_start.sh wiki_books
but when you then continue to pretrain using the README instruction
bash scripts/run_pretraining.sh
it complains about the file/directory not existing.
Looking at the run_pretraining.sh script, it has
DATASET_P1="tfrecord_lower_case_1_seq_len_128_random_seed_12345/books_wiki_en_corpus/train/pretrain_data*" # change this for other datasets
DATASET_P2="tfrecord_lower_case_1_seq_len_512_random_seed_12345/books_wiki_en_corpus/train/pretrain_data*" # change this for other datasets
which are preset to the books_wiki directory, with the comment that these need to be (manually) "changed" for other datasets (e.g. wiki-only)
Changing these manually to the 'wikicorpus_en' directory allowed the pretraining to succeed, but the script ideally shouldn't need editing.
Describe the solution you'd like
It should be a simple change to include a command-line option to the run_pretraining script for "wiki-only" .
Describe alternatives you've considered
Alternatively, it should be documented in the README that this script file needs to be editted if running only from wiki data.
Additional context
none
The text was updated successfully, but these errors were encountered: