Skip to content

[ELECTRA/TF2] Option To Allow scripts/run_pretraining.sh To Use "wiki_only" #1319

Open
@psharpe99

Description

@psharpe99

Related to ELECTRA/TF2

Is your feature request related to a problem? Please describe.
The README shows that the datasets can be created from wiki-only:
/workspace/electra/data/create_datasets_from_start.sh wiki_books
but when you then continue to pretrain using the README instruction
bash scripts/run_pretraining.sh
it complains about the file/directory not existing.
Looking at the run_pretraining.sh script, it has
DATASET_P1="tfrecord_lower_case_1_seq_len_128_random_seed_12345/books_wiki_en_corpus/train/pretrain_data*" # change this for other datasets
DATASET_P2="tfrecord_lower_case_1_seq_len_512_random_seed_12345/books_wiki_en_corpus/train/pretrain_data*" # change this for other datasets
which are preset to the books_wiki directory, with the comment that these need to be (manually) "changed" for other datasets (e.g. wiki-only)
Changing these manually to the 'wikicorpus_en' directory allowed the pretraining to succeed, but the script ideally shouldn't need editing.

Describe the solution you'd like
It should be a simple change to include a command-line option to the run_pretraining script for "wiki-only" .

Describe alternatives you've considered
Alternatively, it should be documented in the README that this script file needs to be editted if running only from wiki data.

Additional context
none

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions