Description
Related to ELECTRA/TF2
Describe the bug
As a proof-of-concept of the ELECTRA/TF2 process, I have downloaded the 90Gb wiki data.
This contains 20-million pages of data.
I have extracted just the first 10,000 pages of text from the data, as my XML data input:
root@biber:/home/psharpe/DeepLearningExamples-master/TensorFlow2/LanguageModeling/ELECTRA/data/download/wikicorpus_en# ls -l
-rwxr-xr-x 1 psharpe users 290 Jun 30 09:08 shorten.sh
-rw-r--r-- 1 nobody nogroup 391764374 Jun 30 09:45 wikicorpus_en.xml
-rw-r--r-- 1 nobody nogroup 391764374 Jun 30 09:40 wikicorpus_en.xml.10000
-rw-r--r-- 1 nobody nogroup 0 Jun 29 14:48 wikicorpus_en.xml.bz2
-rw-r--r-- 1 nobody nogroup 94992294413 Jun 28 16:29 wikicorpus_en.xml.FULL
I have created the datasets from this reduced file: the create_datasets script itself runs a lot quicker, having just ~7,500 real pages of data (and ~2,500 pages of "related to" pages)
When I previously ran the run_pretraining.sh script on the datasets from the full data, using a single available A100 GPU, it reported an ETA of over 300h, which is understandable.
When I run the script on the datasets from the reduced data, it still reports an ETA of 300h.
Elapsed: 0h 2m 5s, ETA: 348h55m57s,
Elapsed: 0h 6m13s, ETA: 292h12m18s,
Elapsed: 0h 7m42s, ETA: 539h26m14s,
Elapsed: 0h 9m36s, ETA: 185h21m42s,
Elapsed: 0h24m25s, ETA: 369h41m 5s,
Elapsed: 0h39m12s, ETA: 310h33m23s,
Elapsed: 0h54m 5s, ETA: 289h56m 9s,
Elapsed: 1h 8m53s, ETA: 278h55m 6s,
Elapsed: 1h23m40s, ETA: 272h 6m14s,
Elapsed: 1h38m29s, ETA: 267h28m 6s,
I was expecting that having a reduced dataset of 100,000 text pages instead of 20,000,000 pages would give a substantial reduction in the ETA for the pre-training, in the same way that it substantially reduced the time for the creation of the datasets.
It is unclear if
- this ETA is accurate despite the reduced dataset, or
- whether there is some hard-coding in its calculation that is perhaps based on figures for the full datasets, or
- there is a bug in the calculations
To Reproduce
I followed the README instructions for create_dataset.sh to pull the fill wiki-only data, and to have it unzip'd
I then scripted AWK to read only the text for the first n pages, to reduce the dataset as in the ls listing above.
I removed the created directories from the data directory
rm -rf extracted formatted_one_article_per_line/ sharded_training_shards_2048_test_shards_2048_fraction_0.1/ tfrecord_lower_case_1_seq_len_*
I reran the create_datasets.sh to completion.
I then run the pretraining as
bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100_1gpu_amp)
The output shows an Elapsed and ETA message every 15 minutes, but the ETA is still around 300h. While it has dropped in value by several hours in each 15-minute output, it does seem to narrow-in on a consistent value which is still very high at ~250h.
Expected behavior
I expected that the pre-training ETA would be vastly reduced by the vastly reduced dataset. If there was a simple linear relationship between the data size and the ETA, then I was expecting that the 300h (=18,000 minutes) would be reduced by
18000 * 10000 / 20000000 = 9 (minutes)
Environment
Container as created by
bash scripts/docker/build.sh
bash scripts/docker/launch.sh
Logged output shows:
[1,0]<stdout>:DLL 2023-06-30 08:07:12.172547 - PARAMETER NVIDIA_TENSORFLOW_VERSION : 20.07-tf2 TENSORFLOW_VERSION : 2.2.0 CUBLAS_VERSION : 11.1.0.229 NCCL_VERSION : 2.7.6 CUDA_DRIVER_VERSION : 450.51.05 CUDNN_VERSION : 8.0.1.13 CUDA_VERSION : 11.0.194 NVIDIA_PIPELINE_ID : None NVIDIA_BUILD_ID : 14714731 NVIDIA_TF32_OVERRIDE : None
[1,0]<stderr>:2023-06-30 08:07:12.601403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:31:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
[1,0]<stderr>:coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.39GiB deviceMemoryBandwidth: 1.41TiB/s
[1,0]<stderr>:2023-06-30 08:07:12.968661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37416 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:31:00.0, compute capability: 8.0)
GPUs:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:31:00.0 Off | 0 |
| N/A 38C P0 37W / 250W | 38668MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:4B:00.0 Off | 0 |
| N/A 37C P0 26W / 250W | 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1206417 C python3 38666MiB |
+-----------------------------------------------------------------------------+