Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ELECTRA/TensorFlow2] The pretraining Issues A Lot Of Warning Messages #1327

Open
psharpe99 opened this issue Jul 10, 2023 · 0 comments
Open
Labels
bug Something isn't working

Comments

@psharpe99
Copy link

Related to ELECTRA/TensorFlow2

Describe the bug
When running the pretraining as

root@biber:/workspace/electra# bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100_1gpu_amp)

there are a lot of Warning messages output to stderr. These may not cause issues, but they do not inspire confidence in the code.

[1,0]:WARNING:tensorflow:Layer activation is casting an input tensor from dtype float16 to the layer's dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because it's dtype defaults to floatx.
[1,0]:
[1,0]:If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.
[1,0]:
[1,0]:To change all layers to have dtype float16 by default, call tf.keras.backend.set_floatx('float16'). To change just this layer, pass dtype='float16' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[1,0]:/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/indexed_slices.py:434: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
[1,0]: "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
[1,0]:WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
[1,0]:Instructions for updating:
[1,0]:If using Keras pass *_constraint arguments to layers.

[1,0]:2023-07-10 14:09:06.734098: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:95] Unknown compute capability (8, 0) .Defaulting to telling LLVM that we're compiling for sm_75
{repeated many times!}

[1,0]:2023-07-10 14:09:45.673506: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1631] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

[1,0]:2023-07-10 14:11:49.818554: W ./tensorflow/compiler/xla/service/hlo_pass_fix.h:49] Unexpectedly high number of iterations in HLO passes, exiting fixed point loop.

To Reproduce
Run the pretraining command within the container as per the README instructions.
I am using a node with a single A100 GPU and a single V100 GPU, so using the config as specified in the pretraining config file for a single A100 AMP.

Expected behavior
There should be no warnings issued.

Environment
The container as created by the README instructions.

root@biber:/workspace/electra# bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100_1gpu_amp)
Container nvidia build = 14714731
Launch command: horovodrun -np 1 python3 /workspace/electra/run_pretraining.py --model_name=base --pretrain_tfrecords=/workspace/electra/data/tfrecord_lower_case_1_seq_len_128_random_seed_12345/wikicorpus_en/train/pretrain_data* --model_size=base --train_batch_size=176 --max_seq_length=128 --disc_weight=50.0 --generator_hidden_size=0.3333333 --num_train_steps=100 --num_warmup_steps=100 --save_checkpoints_steps=50 --learning_rate=6e-3 --optimizer=lamb --skip_adaptive --opt_beta_1=0.878 --opt_beta_2=0.974 --lr_decay_power=0.5 --seed=42 --amp --xla --gradient_accumulation_steps=384 --log_dir /workspace/electra/results

A node with a single A100 GPU and a single V100 GPU, although only the A100 is requested to be used:

[1,0]:DLL 2023-07-10 14:08:43.026273 - PARAMETER NVIDIA_TENSORFLOW_VERSION : 20.07-tf2 TENSORFLOW_VERSION : 2.2.0 CUBLAS_VERSION : 11.1.0.229 NCCL_VERSION : 2.7.6 CUDA_DRIVER_VERSION : 450.51.05 CUDNN_VERSION : 8.0.1.13 CUDA_VERSION : 11.0.194 NVIDIA_PIPELINE_ID : None NVIDIA_BUILD_ID : 14714731 NVIDIA_TF32_OVERRIDE : None
[1,0]:2023-07-10 14:08:43.130546: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
[1,0]:2023-07-10 14:08:43.195747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
[1,0]:pciBusID: 0000:31:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
[1,0]:coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.39GiB deviceMemoryBandwidth: 1.41TiB/s
[1,0]:2023-07-10 14:08:43.195939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
[1,0]:pciBusID: 0000:4b:00.0 name: Tesla V100-PCIE-16GB computeCapability: 7.0
[1,0]:coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 15.77GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]:2023-07-10 14:08:43.460646: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
[1,0]:pciBusID: 0000:31:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
[1,0]:coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.39GiB deviceMemoryBandwidth: 1.41TiB/s

[1,0]:2023-07-10 14:08:43.464081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
[1,0]:2023-07-10 14:08:43.464119: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
[1,0]:2023-07-10 14:08:43.801229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]:2023-07-10 14:08:43.801289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0
[1,0]:2023-07-10 14:08:43.801298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N
[1,0]:2023-07-10 14:08:43.804890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37416 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:31:00.0, compute capability: 8.0)
[1,0]:Compute dtype: float16
[1,0]:Variable dtype: float32

@psharpe99 psharpe99 added the bug Something isn't working label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant