Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.py CUDA_ERROR_NO_BINARY_FOR_GPU #26

Open
quizz0n opened this issue May 4, 2021 · 13 comments
Open

train.py CUDA_ERROR_NO_BINARY_FOR_GPU #26

quizz0n opened this issue May 4, 2021 · 13 comments
Labels
docker related to docker use help wanted Extra attention is needed

Comments

@quizz0n
Copy link

quizz0n commented May 4, 2021

Hi @remicres,

So when running the training on the docker/otbtf/gpu:2.4, after successfully opening TensorFlow libraries, I receive this error:

2021-05-04 14:53:04.337410: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
Traceback (most recent call last):
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
         [[{{node Abs_2}}]]
         [[Mean_24/_343]]
  (1) Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
         [[{{node Abs_2}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "sr4rs/code/train.py", line 307, in <module>
    tf.compat.v1.app.run(main)
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/opt/otbtf/lib/python3/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/opt/otbtf/lib/python3/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "sr4rs/code/train.py", line 286, in main
    _do(train_op, merged_losses_summaries, "training")
  File "sr4rs/code/train.py", line 271, in _do
    _, _summary = sess.run([_train_op, _summary_op])
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 967, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1190, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
         [[node Abs_2 (defined at sr4rs/code/train.py:135) ]]
         [[Mean_24/_343]]
  (1) Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
         [[node Abs_2 (defined at sr4rs/code/train.py:135) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'Abs_2':
  File "sr4rs/code/train.py", line 307, in <module>
    tf.compat.v1.app.run(main)
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/opt/otbtf/lib/python3/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/opt/otbtf/lib/python3/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "sr4rs/code/train.py", line 135, in main
    gen_loss_l1 = tf.add_n([tf.reduce_mean(tf.abs(hr_images_fake[factor] -
  File "sr4rs/code/train.py", line 135, in <listcomp>
    gen_loss_l1 = tf.add_n([tf.reduce_mean(tf.abs(hr_images_fake[factor] -
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/ops/math_ops.py", line 401, in abs
    return gen_math_ops._abs(x, name=name)
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/ops/gen_math_ops.py", line 55, in _abs
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/framework/ops.py", line 3528, in _create_op_internal
    ret = Operation(
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

2021-05-04 14:53:04.661530: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
         [[{{node PyFunc}}]]

This is while running on RTX 3070 with CUDA 11.3.

LE: I believe this is due to different versions of CUDA between the host and docker image? 11.3 not being compatible with 11.0?

@remicres
Copy link
Owner

remicres commented May 4, 2021

Hi @quizz0n ,
Can you tell me what is you OS and docker version? I know that to enable GPU with docker is different depending on the version.
How did you start the docker image?
I don't know this problem, looks cuda/docker related...

@remicres remicres added the help wanted Extra attention is needed label May 4, 2021
@quizz0n
Copy link
Author

quizz0n commented May 4, 2021

Yes that's probably right.
I'm using Ubuntu (WSL 2) on Windows 10 OS Build 21370. Docker version 20.10.2, build 20.10.2-0ubuntu1~20.04.2.

LE: This is how I started the docker image:
docker run -ti -u root --entrypoint=/bin/bash --gpus all --env NVIDIA_DISABLE_REQUIRE=1 registry.gitlab.com/latelescop/docker/otbtf/gpu:2.4

@remicres
Copy link
Owner

remicres commented May 6, 2021

This is probably related to WSL2+GPU+CUDA.

I am currently trying to have bullet-proof guidelines to set-up OTBTF on windows with GPU, but I am not very familiar with Windows.

What you could try, is to rebuild the docker image on your computer.

@quizz0n
Copy link
Author

quizz0n commented May 6, 2021

I've tried to run this just now on a clean Ubuntu 20.04 install (real OS, not WSL2), but the error is the same. I'm not very familiar with rebuilding a docker image but I will look into it. Basically to create a new docker image based on this one but with different CUDA?

LE: The error message on Ubuntu 20.04 install:

2021-05-07 01:46:31.577211: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-07 01:46:32.524205: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-05-07 01:46:32.527454: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-05-07 01:46:32.527633: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops.cc:1106 : Not found: No algorithm worked!

Applying this fix: https://stackoverflow.com/questions/38303974/tensorflow-running-error-with-cublas I get:

2021-05-07 01:51:09.087035: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-07 01:51:10.332892: W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

2021-05-07 01:51:10.332988: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Unimplemented: /usr/local/cuda-11.0/bin/ptxas ptxas too old. Falling back to the driver to compile.
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2021-05-07 01:51:10.397086: W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'
2021-05-07 01:51:13.225935: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-05-07 01:51:13.298065: W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'
Traceback (most recent call last):
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
	 [[{{node Abs_2}}]]
	 [[Mean_24/_343]]
  (1) Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
	 [[{{node Abs_2}}]]
0 successful operations.
0 derived errors ignored.

Tried to replace the ptxas as here: tensorflow/tensorflow#45590 I get:

2021-05-07 02:10:25.756964: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-07 02:10:27.049464: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-05-07 02:10:27.439194: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-05-07 02:10:27.515972: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cwise_op_gpu_base.cc:88 : Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
2021-05-07 02:10:28.161538: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cwise_op_gpu_base.cc:88 : Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
2021-05-07 02:10:28.919060: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cwise_op_gpu_base.cc:88 : Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
2021-05-07 02:10:29.299647: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 891.18M (934473728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 02:10:29.300056: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 802.06M (841026304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 02:10:29.300491: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 721.86M (756923648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 02:10:29.300853: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 649.67M (681231360 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 02:10:29.301297: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 584.71M (613108224 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 02:10:29.301713: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 526.23M (551797504 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 02:10:29.302171: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 473.61M (496617728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/opt/otbtf/lib/python3/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
	 [[{{node Abs_2}}]]
	 [[Mean_24/_343]]
  (1) Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
	 [[{{node Abs_2}}]]
0 successful operations.
0 derived errors ignored.

@remicres
Copy link
Owner

remicres commented May 8, 2021

I've tried to run this just now on a clean Ubuntu 20.04 install (real OS, not WSL2), but the error is the same. I'm not very familiar with rebuilding a docker image but I will look into it. Basically to create a new docker image based on this one but with different CUDA?

You should be able to build the docker image with a single command (see this). Maybe you will have to try different build options.

@quizz0n
Copy link
Author

quizz0n commented May 11, 2021

Managed to build a new docker image and successfully trained the network. However when running sr.py I get the fallowing error:

2021-05-11 13:41:27.083035: I tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 1886711 microseconds.
2021-05-11 13:41:27 (INFO) TensorflowModelServe: Source info :
2021-05-11 13:41:27 (INFO) TensorflowModelServe: Receptive field  : [160, 160]
2021-05-11 13:41:27 (INFO) TensorflowModelServe: Placeholder name : lr_input
2021-05-11 13:41:27 (INFO) TensorflowModelServe: Output spacing ratio: 0.25
2021-05-11 13:41:27 (INFO) TensorflowModelServe: The TensorFlow model is used in fully convolutional mode
2021-05-11 13:41:27 (INFO) TensorflowModelServe: Output field of expression: [512, 512]
2021-05-11 13:41:27 (INFO) TensorflowModelServe: Tiling disabled
2021-05-11 13:41:27 (WARNING): Streaming configuration through extended filename is used. Any previous streaming configuration (ram value, streaming mode ...) will be ignored.
2021-05-11 13:41:27 (INFO): File Sentinel-2_B4328_0.5m.tif will be written in 110 blocks of 512x512 pixels
Writing Sentinel-2_B4328_0.5m.tif?&gdal:co:COMPRESS=DEFLATE&streaming:type=tiled&streaming:sizemode=height&streaming:sizevalue=512...: 0% [                                                  ]2021-05-11 13:41:27.770868: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-11 13:41:28.572215: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-05-11 13:41:28.573738: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-05-11 13:41:28.573882: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops.cc:1106 : Not found: No algorithm worked!
Traceback (most recent call last):
  File "sr4rs/code/sr.py", line 76, in <module>
    infer.ExecuteAndWriteOutput()
  File "/opt/otbtf/lib/otb/python/otbApplication.py", line 2321, in ExecuteAndWriteOutput
    return _otbApplication.Application_ExecuteAndWriteOutput(self)
RuntimeError: Exception thrown in otbApplication Application_ExecuteAndWriteOutput: /src/otb/otb/Modules/Remote/otbtf/include/otbTensorflowMultisourceModelBase.hxx:96:
itk::ERROR: TensorflowMultisourceModelFilter(0x27eb450): Can't run the tensorflow session !
Tensorflow error message:
Not found: 2 root error(s) found.
  (0) Not found: No algorithm worked!
	 [[{{node gen/encoder/conv1_9x9/Conv2D}}]]
	 [[output_64/_1075]]
  (1) Not found: No algorithm worked!
	 [[{{node gen/encoder/conv1_9x9/Conv2D}}]]
0 successful operations.
0 derived errors ignored.
OTB Filter debug message:
Output image buffered region: ImageRegion (0x7ffcdd412fc0)
  Dimension: 2
  Index: [0, 0]
  Size: [512, 512]

Input #0:
Requested region: ImageRegion (0x7ffcdd412ff0)
  Dimension: 2
  Index: [0, 0]
  Size: [160, 160]

Tensor shape ("lr_input": {1, 160, 160, 3}
User placeholders:

@remicres
Copy link
Owner

Looks like the error is from

2021-05-11 13:41:28.573738: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED

Strange that you can train the network but not use it at inference time.
You tried with a SavedModel you created? or the pre-trained one?

@quizz0n
Copy link
Author

quizz0n commented May 11, 2021

Tried with a SavedModel I created.
I've seen 2021-05-11 13:41:28.573738: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED and tried to fix with:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

but that generates another error and that's why I wasn't sure that's the issue.

[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/descriptor_database.cc:118] File already exists in database: tensorflow/core/profiler/profiler_service_monitor_result.proto
[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/descriptor.cc:1379] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): 
Traceback (most recent call last):
  File "sr4rs/code/sr.py", line 63, in <module>
    infer = otbApplication.Registry.CreateApplication("TensorflowModelServe")
  File "/opt/otbtf/lib/otb/python/otbApplication.py", line 3544, in CreateApplication
    application = _otbApplication.Registry_CreateApplicationWithoutLogger(name)
RuntimeError: CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):

@remicres
Copy link
Owner

The last error reminds me this issue in OTBTF.
It happens when you try to import both otbApplication and tensorflow in the same python code. It is a current known limitation in OTBTF.

However the failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED looks really CUDA-related

@quizz0n
Copy link
Author

quizz0n commented May 11, 2021

The last error reminds me this issue in OTBTF.
It happens when you try to import both otbApplication and tensorflow in the same python code. It is a current known limitation in OTBTF.

Indeed that looks like its the issue as I'm importing tensorflow to fix failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED.

@quizz0n
Copy link
Author

quizz0n commented May 18, 2021

I think we can close this issue. The initial error Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device is occurring when TF is not properly built for specific GPU. Rebuilding TF / new docker image solves the problem.

@remicres
Copy link
Owner

Thanks. Do you know which parameter(s) did you manage to change?

@quizz0n
Copy link
Author

quizz0n commented May 18, 2021

For docker build change the version of CUDA for BASE_IMG arg and for TF build env variables in build-env-tf.sh add/change for specific TF_CUDA_COMPUTE_CAPABILITIES.

@remicres remicres added the docker related to docker use label Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker related to docker use help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants