Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ELECTRA/TF2] ETA On Reduced Dataset Is Still High #1321

Open
psharpe99 opened this issue Jun 30, 2023 · 0 comments
Open

[ELECTRA/TF2] ETA On Reduced Dataset Is Still High #1321

psharpe99 opened this issue Jun 30, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@psharpe99
Copy link

Related to ELECTRA/TF2

Describe the bug
As a proof-of-concept of the ELECTRA/TF2 process, I have downloaded the 90Gb wiki data.
This contains 20-million pages of data.
I have extracted just the first 10,000 pages of text from the data, as my XML data input:

root@biber:/home/psharpe/DeepLearningExamples-master/TensorFlow2/LanguageModeling/ELECTRA/data/download/wikicorpus_en# ls -l
-rwxr-xr-x 1 psharpe users           290 Jun 30 09:08 shorten.sh
-rw-r--r-- 1 nobody  nogroup   391764374 Jun 30 09:45 wikicorpus_en.xml
-rw-r--r-- 1 nobody  nogroup   391764374 Jun 30 09:40 wikicorpus_en.xml.10000
-rw-r--r-- 1 nobody  nogroup           0 Jun 29 14:48 wikicorpus_en.xml.bz2
-rw-r--r-- 1 nobody  nogroup 94992294413 Jun 28 16:29 wikicorpus_en.xml.FULL

I have created the datasets from this reduced file: the create_datasets script itself runs a lot quicker, having just ~7,500 real pages of data (and ~2,500 pages of "related to" pages)

When I previously ran the run_pretraining.sh script on the datasets from the full data, using a single available A100 GPU, it reported an ETA of over 300h, which is understandable.
When I run the script on the datasets from the reduced data, it still reports an ETA of 300h.

Elapsed:  0h 2m 5s, ETA: 348h55m57s, 
Elapsed:  0h 6m13s, ETA: 292h12m18s, 
Elapsed:  0h 7m42s, ETA: 539h26m14s, 
Elapsed:  0h 9m36s, ETA: 185h21m42s, 
Elapsed:  0h24m25s, ETA: 369h41m 5s, 
Elapsed:  0h39m12s, ETA: 310h33m23s, 
Elapsed:  0h54m 5s, ETA: 289h56m 9s, 
Elapsed:  1h 8m53s, ETA: 278h55m 6s, 
Elapsed:  1h23m40s, ETA: 272h 6m14s, 
Elapsed:  1h38m29s, ETA: 267h28m 6s,

I was expecting that having a reduced dataset of 100,000 text pages instead of 20,000,000 pages would give a substantial reduction in the ETA for the pre-training, in the same way that it substantially reduced the time for the creation of the datasets.

It is unclear if

  • this ETA is accurate despite the reduced dataset, or
  • whether there is some hard-coding in its calculation that is perhaps based on figures for the full datasets, or
  • there is a bug in the calculations

To Reproduce
I followed the README instructions for create_dataset.sh to pull the fill wiki-only data, and to have it unzip'd
I then scripted AWK to read only the text for the first n pages, to reduce the dataset as in the ls listing above.
I removed the created directories from the data directory
rm -rf extracted formatted_one_article_per_line/ sharded_training_shards_2048_test_shards_2048_fraction_0.1/ tfrecord_lower_case_1_seq_len_*
I reran the create_datasets.sh to completion.

I then run the pretraining as
bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100_1gpu_amp)

The output shows an Elapsed and ETA message every 15 minutes, but the ETA is still around 300h. While it has dropped in value by several hours in each 15-minute output, it does seem to narrow-in on a consistent value which is still very high at ~250h.

Expected behavior
I expected that the pre-training ETA would be vastly reduced by the vastly reduced dataset. If there was a simple linear relationship between the data size and the ETA, then I was expecting that the 300h (=18,000 minutes) would be reduced by
18000 * 10000 / 20000000 = 9 (minutes)

Environment
Container as created by

bash scripts/docker/build.sh
bash scripts/docker/launch.sh

Logged output shows:

[1,0]<stdout>:DLL 2023-06-30 08:07:12.172547 - PARAMETER NVIDIA_TENSORFLOW_VERSION : 20.07-tf2  TENSORFLOW_VERSION : 2.2.0  CUBLAS_VERSION : 11.1.0.229  NCCL_VERSION : 2.7.6  CUDA_DRIVER_VERSION : 450.51.05  CUDNN_VERSION : 8.0.1.13  CUDA_VERSION : 11.0.194  NVIDIA_PIPELINE_ID : None  NVIDIA_BUILD_ID : 14714731  NVIDIA_TF32_OVERRIDE : None 

[1,0]<stderr>:2023-06-30 08:07:12.601403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
[1,0]<stderr>:pciBusID: 0000:31:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
[1,0]<stderr>:coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.39GiB deviceMemoryBandwidth: 1.41TiB/s

[1,0]<stderr>:2023-06-30 08:07:12.968661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37416 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:31:00.0, compute capability: 8.0)

GPUs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   38C    P0    37W / 250W |  38668MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   37C    P0    26W / 250W |      4MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1206417      C   python3                         38666MiB |
+-----------------------------------------------------------------------------+
@psharpe99 psharpe99 added the bug Something isn't working label Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant