Skip to content

[ELECTRA/TF2] ETA On Reduced Dataset Is Still High #1321

Open
@psharpe99

Description

@psharpe99

Related to ELECTRA/TF2

Describe the bug
As a proof-of-concept of the ELECTRA/TF2 process, I have downloaded the 90Gb wiki data.
This contains 20-million pages of data.
I have extracted just the first 10,000 pages of text from the data, as my XML data input:

root@biber:/home/psharpe/DeepLearningExamples-master/TensorFlow2/LanguageModeling/ELECTRA/data/download/wikicorpus_en# ls -l
-rwxr-xr-x 1 psharpe users           290 Jun 30 09:08 shorten.sh
-rw-r--r-- 1 nobody  nogroup   391764374 Jun 30 09:45 wikicorpus_en.xml
-rw-r--r-- 1 nobody  nogroup   391764374 Jun 30 09:40 wikicorpus_en.xml.10000
-rw-r--r-- 1 nobody  nogroup           0 Jun 29 14:48 wikicorpus_en.xml.bz2
-rw-r--r-- 1 nobody  nogroup 94992294413 Jun 28 16:29 wikicorpus_en.xml.FULL

I have created the datasets from this reduced file: the create_datasets script itself runs a lot quicker, having just ~7,500 real pages of data (and ~2,500 pages of "related to" pages)

When I previously ran the run_pretraining.sh script on the datasets from the full data, using a single available A100 GPU, it reported an ETA of over 300h, which is understandable.
When I run the script on the datasets from the reduced data, it still reports an ETA of 300h.

Elapsed:  0h 2m 5s, ETA: 348h55m57s, 
Elapsed:  0h 6m13s, ETA: 292h12m18s, 
Elapsed:  0h 7m42s, ETA: 539h26m14s, 
Elapsed:  0h 9m36s, ETA: 185h21m42s, 
Elapsed:  0h24m25s, ETA: 369h41m 5s, 
Elapsed:  0h39m12s, ETA: 310h33m23s, 
Elapsed:  0h54m 5s, ETA: 289h56m 9s, 
Elapsed:  1h 8m53s, ETA: 278h55m 6s, 
Elapsed:  1h23m40s, ETA: 272h 6m14s, 
Elapsed:  1h38m29s, ETA: 267h28m 6s,

I was expecting that having a reduced dataset of 100,000 text pages instead of 20,000,000 pages would give a substantial reduction in the ETA for the pre-training, in the same way that it substantially reduced the time for the creation of the datasets.

It is unclear if

  • this ETA is accurate despite the reduced dataset, or
  • whether there is some hard-coding in its calculation that is perhaps based on figures for the full datasets, or
  • there is a bug in the calculations

To Reproduce
I followed the README instructions for create_dataset.sh to pull the fill wiki-only data, and to have it unzip'd
I then scripted AWK to read only the text for the first n pages, to reduce the dataset as in the ls listing above.
I removed the created directories from the data directory
rm -rf extracted formatted_one_article_per_line/ sharded_training_shards_2048_test_shards_2048_fraction_0.1/ tfrecord_lower_case_1_seq_len_*
I reran the create_datasets.sh to completion.

I then run the pretraining as
bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100_1gpu_amp)

The output shows an Elapsed and ETA message every 15 minutes, but the ETA is still around 300h. While it has dropped in value by several hours in each 15-minute output, it does seem to narrow-in on a consistent value which is still very high at ~250h.

Expected behavior
I expected that the pre-training ETA would be vastly reduced by the vastly reduced dataset. If there was a simple linear relationship between the data size and the ETA, then I was expecting that the 300h (=18,000 minutes) would be reduced by
18000 * 10000 / 20000000 = 9 (minutes)

Environment
Container as created by

bash scripts/docker/build.sh
bash scripts/docker/launch.sh

Logged output shows:

[1,0]<stdout>:DLL 2023-06-30 08:07:12.172547 - PARAMETER NVIDIA_TENSORFLOW_VERSION : 20.07-tf2  TENSORFLOW_VERSION : 2.2.0  CUBLAS_VERSION : 11.1.0.229  NCCL_VERSION : 2.7.6  CUDA_DRIVER_VERSION : 450.51.05  CUDNN_VERSION : 8.0.1.13  CUDA_VERSION : 11.0.194  NVIDIA_PIPELINE_ID : None  NVIDIA_BUILD_ID : 14714731  NVIDIA_TF32_OVERRIDE : None 

[1,0]<stderr>:2023-06-30 08:07:12.601403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
[1,0]<stderr>:pciBusID: 0000:31:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
[1,0]<stderr>:coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.39GiB deviceMemoryBandwidth: 1.41TiB/s

[1,0]<stderr>:2023-06-30 08:07:12.968661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37416 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:31:00.0, compute capability: 8.0)

GPUs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   38C    P0    37W / 250W |  38668MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   37C    P0    26W / 250W |      4MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1206417      C   python3                         38666MiB |
+-----------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions