Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while training with CuDNN arg set as True #57

Open
TLfERLS opened this issue May 29, 2019 · 0 comments
Open

Error while training with CuDNN arg set as True #57

TLfERLS opened this issue May 29, 2019 · 0 comments

Comments

@TLfERLS
Copy link

TLfERLS commented May 29, 2019

I tried to start training the model by using the default configuration file for quora. This has use_cudnn=true. But it has run into some unexpected error, when I run the SentenceMatchTrainer.py file. The error is as follows:

(tensorflowGPU) D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src>python SentenceMatchTrainer.py --config_path "../configs/quora.sample.config"
Loading the configuration from ../configs/quora.sample.config

{'train_path': '../data/quora/train.tsv', 'dev_path': '../data/quora/dev.tsv', 
'word_vec_path': '../data/quora/wordvec.txt', 'model_dir': 'quora_model', 'suffix': 'quora', 'fix_word_vec': True, 'isLower': True, 'max_sent_length': 50, 'max_char_per_word': 10, 
'with_char': True, 'char_emb_dim': 20, 'char_lstm_dim': 40, 'batch_size': 60, 'max_epochs': 20, 'dropout_rate': 0.1, 'learning_rate': 0.0005, 'optimize_type': 'adam', 'lambda_l2': 0.0,
 'grad_clipper': 10.0, 'context_layer_num': 1, 'context_lstm_dim': 100,
 'aggregation_layer_num': 1, 'aggregation_lstm_dim': 100, 'with_full_match': True, 'with_maxpool_match': False, 'with_max_attentive_match': False, 'with_attentive_match': True, 
'with_cosine': True, 'with_mp_cosine': True, 'cosine_MP_dim': 5, 'att_dim': 50, 'att_type': 'symmetric', 'highway_layer_num': 1, 
'with_highway': True, 'with_match_highway': True, 
'with_aggregation_highway': True, 'use_cudnn': True, 'with_moving_average': False}

Collecting words, chars and labels ...
Number of words: 104891
Number of chars: 1198
word_vocab shape is (106686, 300)
Number of labels: 2
Build SentenceMatchDataStream ...
Number of instances in trainDataStream: 384348
Number of batches in trainDataStream: 6406
Number of instances in devDataStream: 10000
Number of batches in devDataStream: 167
2019-05-30 00:41:22.120164: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-05-30 00:41:23.282409: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.48
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2019-05-30 00:41:23.289931: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0
2019-05-30 00:41:25.325066: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-30 00:41:25.329970: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929]      0
2019-05-30 00:41:25.332505: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0:   N
2019-05-30 00:41:25.337204: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4740 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
    return fn(*args)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1305, in _run_fn
    self._extend_graph()
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:

Colocation members and user-requested devices:
  Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
  Model/global_norm/L2Loss_38 (L2Loss)

         [[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "SentenceMatchTrainer.py", line 257, in <module>
    main(FLAGS)
  File "SentenceMatchTrainer.py", line 191, in main
    sess.run(initializer)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 900, in run
    run_metadata_ptr)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run
    run_metadata)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:

Colocation members and user-requested devices:
  Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
  Model/global_norm/L2Loss_38 (L2Loss)

         [[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]

Caused by op 'Model/global_norm/L2Loss_38', defined at:
  File "SentenceMatchTrainer.py", line 257, in <module>
    main(FLAGS)
  File "SentenceMatchTrainer.py", line 175, in main
    is_training=True, options=FLAGS, global_step=global_step)
  File "D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src\SentenceMatchModelGraph.py", line 10, in __init__
    self.create_model_graph(num_classes, word_vocab, char_vocab, is_training, global_step=global_step)
  File "D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src\SentenceMatchModelGraph.py", line 175, in create_model_graph
    grads, _ = tf.clip_by_global_norm(grads, self.options.grad_clipper)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\clip_ops.py", line 240, in clip_by_global_norm
    use_norm = global_norm(t_list, name)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\clip_ops.py", line 179, in global_norm
    half_squared_norms.append(gen_nn_ops.l2_loss(v))
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 4679, in l2_loss
    "L2Loss", t=t, name=name)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
    op_def=op_def)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:

Colocation members and user-requested devices:
  Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
  Model/global_norm/L2Loss_38 (L2Loss)

         [[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]

When I set use_cudnn:false, the training starts without any problems. In this case, it is still using the GPU. I understand from the code that use_cudnn=true helps make use of the CudnnLSTM, but maybe the issue arises due to OS or the Tensorflow version. The details of the environment are:
OS : Windows10
Python: 3.6.8
Tensorflow_GPU version: 1.8
GPU: GTX 1060 6 GB

Can you tell where the problem lies ? In the meantime, I'll try to run this the program with default configs on an Ubuntu machine and see the results. Thanks !

@TLfERLS TLfERLS changed the title Error while training while training with CuDNN arg set as True Error while training with CuDNN arg set as True May 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant