You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
As a proof-of-concept of the ELECTRA/TF2 process, I have downloaded the 90Gb wiki data.
This contains 20-million pages of data.
I have extracted just the first 10,000 pages of text from the data, as my XML data input:
root@biber:/home/psharpe/DeepLearningExamples-master/TensorFlow2/LanguageModeling/ELECTRA/data/download/wikicorpus_en# ls -l
-rwxr-xr-x 1 psharpe users 290 Jun 30 09:08 shorten.sh
-rw-r--r-- 1 nobody nogroup 391764374 Jun 30 09:45 wikicorpus_en.xml
-rw-r--r-- 1 nobody nogroup 391764374 Jun 30 09:40 wikicorpus_en.xml.10000
-rw-r--r-- 1 nobody nogroup 0 Jun 29 14:48 wikicorpus_en.xml.bz2
-rw-r--r-- 1 nobody nogroup 94992294413 Jun 28 16:29 wikicorpus_en.xml.FULL
I have created the datasets from this reduced file: the create_datasets script itself runs a lot quicker, having just ~7,500 real pages of data (and ~2,500 pages of "related to" pages)
When I previously ran the run_pretraining.sh script on the datasets from the full data, using a single available A100 GPU, it reported an ETA of over 300h, which is understandable.
When I run the script on the datasets from the reduced data, it still reports an ETA of 300h.
I was expecting that having a reduced dataset of 100,000 text pages instead of 20,000,000 pages would give a substantial reduction in the ETA for the pre-training, in the same way that it substantially reduced the time for the creation of the datasets.
It is unclear if
this ETA is accurate despite the reduced dataset, or
whether there is some hard-coding in its calculation that is perhaps based on figures for the full datasets, or
there is a bug in the calculations
To Reproduce
I followed the README instructions for create_dataset.sh to pull the fill wiki-only data, and to have it unzip'd
I then scripted AWK to read only the text for the first n pages, to reduce the dataset as in the ls listing above.
I removed the created directories from the data directory rm -rf extracted formatted_one_article_per_line/ sharded_training_shards_2048_test_shards_2048_fraction_0.1/ tfrecord_lower_case_1_seq_len_*
I reran the create_datasets.sh to completion.
I then run the pretraining as bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100_1gpu_amp)
The output shows an Elapsed and ETA message every 15 minutes, but the ETA is still around 300h. While it has dropped in value by several hours in each 15-minute output, it does seem to narrow-in on a consistent value which is still very high at ~250h.
Expected behavior
I expected that the pre-training ETA would be vastly reduced by the vastly reduced dataset. If there was a simple linear relationship between the data size and the ETA, then I was expecting that the 300h (=18,000 minutes) would be reduced by
18000 * 10000 / 20000000 = 9 (minutes)
Related to ELECTRA/TF2
Describe the bug
As a proof-of-concept of the ELECTRA/TF2 process, I have downloaded the 90Gb wiki data.
This contains 20-million pages of data.
I have extracted just the first 10,000 pages of text from the data, as my XML data input:
I have created the datasets from this reduced file: the create_datasets script itself runs a lot quicker, having just ~7,500 real pages of data (and ~2,500 pages of "related to" pages)
When I previously ran the run_pretraining.sh script on the datasets from the full data, using a single available A100 GPU, it reported an ETA of over 300h, which is understandable.
When I run the script on the datasets from the reduced data, it still reports an ETA of 300h.
I was expecting that having a reduced dataset of 100,000 text pages instead of 20,000,000 pages would give a substantial reduction in the ETA for the pre-training, in the same way that it substantially reduced the time for the creation of the datasets.
It is unclear if
To Reproduce
I followed the README instructions for create_dataset.sh to pull the fill wiki-only data, and to have it unzip'd
I then scripted AWK to read only the text for the first n pages, to reduce the dataset as in the ls listing above.
I removed the created directories from the data directory
rm -rf extracted formatted_one_article_per_line/ sharded_training_shards_2048_test_shards_2048_fraction_0.1/ tfrecord_lower_case_1_seq_len_*
I reran the create_datasets.sh to completion.
I then run the pretraining as
bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100_1gpu_amp)
The output shows an Elapsed and ETA message every 15 minutes, but the ETA is still around 300h. While it has dropped in value by several hours in each 15-minute output, it does seem to narrow-in on a consistent value which is still very high at ~250h.
Expected behavior
I expected that the pre-training ETA would be vastly reduced by the vastly reduced dataset. If there was a simple linear relationship between the data size and the ETA, then I was expecting that the 300h (=18,000 minutes) would be reduced by
18000 * 10000 / 20000000 = 9 (minutes)
Environment
Container as created by
Logged output shows:
GPUs:
The text was updated successfully, but these errors were encountered: