[ELECTRA/TensorFlow2] The pretraining Issues A Lot Of Warning Messages

Related to **ELECTRA/TensorFlow2** 

**Describe the bug**
When running the pretraining as

root@biber:/workspace/electra# bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100_1gpu_amp)

there are a lot of Warning messages output to _stderr_. These may not cause issues, but they do not inspire confidence in the code.

[1,0]<stderr>:WARNING:tensorflow:Layer activation is casting an input tensor from dtype float16 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.
[1,0]<stderr>:
[1,0]<stderr>:If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.
[1,0]<stderr>:
[1,0]<stderr>:To change all layers to have dtype float16 by default, call `tf.keras.backend.set_floatx('float16')`. To change just this layer, pass dtype='float16' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/indexed_slices.py:434: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
[1,0]<stderr>:  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
[1,0]<stderr>:WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
[1,0]<stderr>:Instructions for updating:
[1,0]<stderr>:If using Keras pass *_constraint arguments to layers.

[1,0]<stderr>:2023-07-10 14:09:06.734098: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:95] Unknown compute capability (8, 0) .Defaulting to telling LLVM that we're compiling for sm_75
{repeated many times!}

[1,0]<stderr>:2023-07-10 14:09:45.673506: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1631] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

[1,0]<stderr>:2023-07-10 14:11:49.818554: W ./tensorflow/compiler/xla/service/hlo_pass_fix.h:49] Unexpectedly high number of iterations in HLO passes, exiting fixed point loop.

**To Reproduce**
Run the pretraining command within the container as per the README instructions.
I am using a node with a single A100 GPU and a single V100 GPU, so using the config as specified in the pretraining config file for a single A100 AMP.

**Expected behavior**
There should be no warnings issued.

**Environment**
The container as created by the README instructions.

root@biber:/workspace/electra# bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100_1gpu_amp)
Container nvidia build =  14714731
Launch command: horovodrun -np 1  python3  /workspace/electra/run_pretraining.py --model_name=base --pretrain_tfrecords=/workspace/electra/data/tfrecord_lower_case_1_seq_len_128_random_seed_12345/wikicorpus_en/train/pretrain_data* --model_size=base --train_batch_size=176 --max_seq_length=128 --disc_weight=50.0 --generator_hidden_size=0.3333333  --num_train_steps=100 --num_warmup_steps=100 --save_checkpoints_steps=50 --learning_rate=6e-3 --optimizer=lamb --skip_adaptive --opt_beta_1=0.878 --opt_beta_2=0.974 --lr_decay_power=0.5 --seed=42 --amp  --xla --gradient_accumulation_steps=384  --log_dir /workspace/electra/results 

A node with a single A100 GPU and a single V100 GPU, although only the A100 is requested to be used:

[1,0]<stdout>:DLL 2023-07-10 14:08:43.026273 - PARAMETER NVIDIA_TENSORFLOW_VERSION : 20.07-tf2  TENSORFLOW_VERSION : 2.2.0  CUBLAS_VERSION : 11.1.0.229  NCCL_VERSION : 2.7.6  CUDA_DRIVER_VERSION : 450.51.05  CUDNN_VERSION : 8.0.1.13  CUDA_VERSION : 11.0.194  NVIDIA_PIPELINE_ID : None  NVIDIA_BUILD_ID : 14714731  NVIDIA_TF32_OVERRIDE : None 
[1,0]<stderr>:2023-07-10 14:08:43.130546: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
[1,0]<stderr>:2023-07-10 14:08:43.195747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
[1,0]<stderr>:pciBusID: 0000:31:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
[1,0]<stderr>:coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.39GiB deviceMemoryBandwidth: 1.41TiB/s
[1,0]<stderr>:2023-07-10 14:08:43.195939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties: 
[1,0]<stderr>:pciBusID: 0000:4b:00.0 name: Tesla V100-PCIE-16GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 15.77GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2023-07-10 14:08:43.460646: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
[1,0]<stderr>:pciBusID: 0000:31:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
[1,0]<stderr>:coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.39GiB deviceMemoryBandwidth: 1.41TiB/s

[1,0]<stderr>:2023-07-10 14:08:43.464081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
[1,0]<stderr>:2023-07-10 14:08:43.464119: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stderr>:2023-07-10 14:08:43.801229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2023-07-10 14:08:43.801289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
[1,0]<stderr>:2023-07-10 14:08:43.801298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
[1,0]<stderr>:2023-07-10 14:08:43.804890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37416 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:31:00.0, compute capability: 8.0)
[1,0]<stdout>:Compute dtype: float16
[1,0]<stdout>:Variable dtype: float32


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ELECTRA/TensorFlow2] The pretraining Issues A Lot Of Warning Messages #1327

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ELECTRA/TensorFlow2] The pretraining Issues A Lot Of Warning Messages #1327

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions