[ELECTRA/TF2] ETA On Reduced Dataset Is Still High

Related to **ELECTRA/TF2** 

**Describe the bug**
As a proof-of-concept of the ELECTRA/TF2 process, I have downloaded the 90Gb wiki data.
This contains 20-million pages of data.
I have extracted just the first 10,000 pages of text from the data, as my XML data input:

```
root@biber:/home/psharpe/DeepLearningExamples-master/TensorFlow2/LanguageModeling/ELECTRA/data/download/wikicorpus_en# ls -l
-rwxr-xr-x 1 psharpe users           290 Jun 30 09:08 shorten.sh
-rw-r--r-- 1 nobody  nogroup   391764374 Jun 30 09:45 wikicorpus_en.xml
-rw-r--r-- 1 nobody  nogroup   391764374 Jun 30 09:40 wikicorpus_en.xml.10000
-rw-r--r-- 1 nobody  nogroup           0 Jun 29 14:48 wikicorpus_en.xml.bz2
-rw-r--r-- 1 nobody  nogroup 94992294413 Jun 28 16:29 wikicorpus_en.xml.FULL
```
I have created the datasets from this reduced file: the create_datasets script itself runs a lot quicker, having just ~7,500 real pages of data (and ~2,500 pages of "related to" pages)

When I previously ran the run_pretraining.sh script on the datasets from the full data, using a single available A100 GPU, it reported an ETA of over 300h, which is understandable.
When I run the script on the datasets from the reduced data, it still reports an ETA of 300h.
```
Elapsed:  0h 2m 5s, ETA: 348h55m57s, 
Elapsed:  0h 6m13s, ETA: 292h12m18s, 
Elapsed:  0h 7m42s, ETA: 539h26m14s, 
Elapsed:  0h 9m36s, ETA: 185h21m42s, 
Elapsed:  0h24m25s, ETA: 369h41m 5s, 
Elapsed:  0h39m12s, ETA: 310h33m23s, 
Elapsed:  0h54m 5s, ETA: 289h56m 9s, 
Elapsed:  1h 8m53s, ETA: 278h55m 6s, 
Elapsed:  1h23m40s, ETA: 272h 6m14s, 
Elapsed:  1h38m29s, ETA: 267h28m 6s,
```
I was expecting that having a reduced dataset of 100,000 text pages instead of 20,000,000 pages would give a substantial reduction in the ETA for the pre-training, in the same way that it substantially reduced the time for the creation of the datasets.

It is unclear if 
- this ETA is accurate despite the reduced dataset, or 
- whether there is some hard-coding in its calculation that is perhaps based on figures for the full datasets, or
- there is a bug in the calculations

**To Reproduce**
I followed the README instructions for create_dataset.sh to pull the fill wiki-only data, and to have it unzip'd
I then scripted AWK to read only the text for the first _n_ pages, to reduce the dataset as in the _ls_ listing above.
I removed the created directories from the data directory
`    rm -rf extracted formatted_one_article_per_line/ sharded_training_shards_2048_test_shards_2048_fraction_0.1/ tfrecord_lower_case_1_seq_len_*`
I reran the create_datasets.sh to completion.

I then run the pretraining as
`bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100_1gpu_amp)`

The output shows an Elapsed and ETA message every 15 minutes, but the ETA is still around 300h. While it has dropped in value by several hours in each 15-minute output, it does seem to narrow-in on a consistent value which is still very high at ~250h.

**Expected behavior**
I expected that the pre-training ETA would be vastly reduced by the vastly reduced dataset. If there was a simple linear relationship between the data size and the ETA, then I was expecting that the 300h (=18,000 minutes) would be reduced by
18000 * 10000 / 20000000 = 9 (minutes)

**Environment**
Container as created by
```
bash scripts/docker/build.sh
bash scripts/docker/launch.sh
```

Logged output shows:
```
[1,0]<stdout>:DLL 2023-06-30 08:07:12.172547 - PARAMETER NVIDIA_TENSORFLOW_VERSION : 20.07-tf2  TENSORFLOW_VERSION : 2.2.0  CUBLAS_VERSION : 11.1.0.229  NCCL_VERSION : 2.7.6  CUDA_DRIVER_VERSION : 450.51.05  CUDNN_VERSION : 8.0.1.13  CUDA_VERSION : 11.0.194  NVIDIA_PIPELINE_ID : None  NVIDIA_BUILD_ID : 14714731  NVIDIA_TF32_OVERRIDE : None 

[1,0]<stderr>:2023-06-30 08:07:12.601403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
[1,0]<stderr>:pciBusID: 0000:31:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
[1,0]<stderr>:coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.39GiB deviceMemoryBandwidth: 1.41TiB/s

[1,0]<stderr>:2023-06-30 08:07:12.968661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37416 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:31:00.0, compute capability: 8.0)
```

GPUs:
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   38C    P0    37W / 250W |  38668MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   37C    P0    26W / 250W |      4MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1206417      C   python3                         38666MiB |
+-----------------------------------------------------------------------------+
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ELECTRA/TF2] ETA On Reduced Dataset Is Still High #1321

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ELECTRA/TF2] ETA On Reduced Dataset Is Still High #1321

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions