-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intel Extension for Tensorflow reports missing CUDA drivers and fails to use ARC GPU #61
Comments
Can you please try the driver version suggested here : https://github.com/intel/intel-extension-for-tensorflow/blob/main/docs/install/install_for_xpu.md#install-gpu-drivers |
Hi @srinarayan-srikanthan, this error is produced when I have used the driver suggested on that page by following the installation instructions linked to https://github.com/intel/intel-extension-for-tensorflow/blob/main/docs/install/experimental/install_for_arc_gpu.md. Is there another procedure I should use to install the recommended driver version? |
Hi @djsv23 , yes you are looking into the rite page, but the version you have installed is stable_775_20_20231219 instead of stable_736_25_2023103. The instruction on the page specifies the versions as below ![]() But the output of your env_check script is not showing the rite versions. Can you please check on that. |
Would you mind adding the link to the instructions page where this version list is found? I'm having a hard time navigating and understanding which are the current instructions as https://dgpu-docs.intel.com/driver/client/overview.html does not have them, and is the only driver installation instruction page I can find that does not have a deprecation notice. I see also this pinned version list in the Ubuntu 22.04 for WSL instructions in this repository, but the same instructions are not given for Ubuntu 22.04 on bare metal, which is how my environment is configured. That said, I have installed the mentioned versions and still see an output from env_check.sh which suggests that required drivers are missing. ` ======================== Check Python ======================== python3.10 is installed. ==================== Check Python Passed ===================== ========================== Check OS ========================== OS ubuntu:22.04 is Supported. ====================== Check OS Passed ======================= ====================== Check Tensorflow ====================== Tensorflow2.14 is installed. ================== Check Tensorflow Passed =================== =================== Check Intel GPU Driver =================== Intel(R) graphics runtime intel-level-zero-gpu-1.3.26918.50-736 is installed, but is not recommended . =============== Check Intel GPU Driver Finshed ================ ===================== Check Intel oneAPI ===================== Intel(R) oneAPI DPC++/C++ Compiler is installed. ================= Check Intel oneAPI Passed ================== ========================== Check Devices Availability ========================== 2024-02-07 12:30:08.568077: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable ====================== Check Devices Availability Passed ======================= |
So from your env_check.sh I see "itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.", so it does load the backend, what is the error you are facing when you try to run the following command With regard to the other question, you can find instructions here : https://github.com/intel/intel-extension-for-tensorflow/blob/main/docs/install/experimental/install_for_arc_gpu.md#native-linux-running-directly-on-hardware-1 |
The error I am getting is that despite the GPU backend being loaded, it says that GPU will not be used and cannot find a CUDA-capable device. "2024-02-08 09:57:04.630925: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used." 2024-02-08 09:57:14.392825: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected |
Okay can you post the complete output please of the import command. |
python -c "import intel_extension_for_tensorflow as itex; print(itex.version)" |
Okay, i see GPU backed being loaded and then failing, can you paste output of conda list | grep tensorflow ? |
(itex) user@host:~$ conda list | grep tensorflow |
are these all the packages, did you install tensorflow before installing intel-extension-for-tensorflow ? |
This is the entire output - at one point I think I had removed and reinstalled intel-extension-for-tensorflow in this conda environment. Should I try setting up a new one from scratch? |
Yes please try creating the environment from scratch following the instructions starting from here : https://github.com/intel/intel-extension-for-tensorflow/blob/main/docs/install/experimental/install_for_arc_gpu.md#2-install-tensorflow-via-pypi-wheel-in-linux |
Ok. I've torn everything down, removed miniconda, spun up a new python 3.10 virtual environment and get this output now. It no longer fails to import the module, but still is not using the GPU. pip list | grep tensorflow python -c "import intel_extension_for_tensorflow as itex; print(itex.version)" |
I see the GPU backend being loaded, can you try itex.get_backend() after importing it. Or list physical devices and check. It should now be using the GPU. |
I checked the physical device and it is present (I am using the GPU for desktop video output, so no surprise there) xpu-smi discovery So then I try generating an image with keras_cv as per the intel tutorial at https://medium.com/intel-analytics-software/running-tensorflow-stable-diffusion-on-intel-arc-gpus-e6ff0d2b7549 and we get this output: I also get this in the logs on the Jupyter server: 2024-02-09 11:23:11.419551: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable End result is that it seems to be not finding the device and trying to proceed anyway. It finds that 0 VRAM is insufficient and fails to generate an image. |
It is able to detect the device and loading it because I see XPU being enabled "2024-02-09 11:24:42.064980: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type XPU is enabled. |
@srinarayan-srikanthan I have done this; running the computer headless and accessing remotely, intel_gpu_top shows that the GPU is on standby at 0% usage and 0 mhz clock speed. I'm further convinced the GPU is not used, because no image is generated. Images[] is empty so when we run plt.imshow(images[0]) there is no output. |
What is the output of xpu-smi ? |
It is the same as before. Is there another subcommand that would be helpful to see? xpu-smi discovery |
Here are two suggestions, can you try reducing the size of the image and running it? The tutorial you shared was for Arc 770 which comes with 16GB whereas 750 is equipped with 8 GB. |
@djsv23 , We run into another similar issue recently: intel/intel-extension-for-transformers#1276. Not sure if it helps. Just for your reference, could you please try below in the terminal environment first? (after it works, then try jupyter-notebook)
(itex214) yhu5@arc770-tce:
and let us know the screen output. thanks |
There are a number of issues I'm seeing with this set of instructions:
@yinghu5 Here is the output from the terminal before attempting the Jupyter notebook:
@srinarayan-srikanthan Running the instructions in the sample file gives again the same error of no CUDA-capable device detected and fails to generate any images: |
The error you are referring to in this "2024-02-09 11:23:13.180820: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected" is normal behavior. TensorFlow package has cuda support by default, so they try to launch cuda. The reason for the image not being created could be the memory issue. Can you try running any other workload and see if memory is the issue. Thank you for the suggestion, will update the Readme with instructions for patch file. |
@djsv23 thank you a lot for checking. Then the libstdc++ library in your environment is correct. Please change back, the problem is not related to it. As Sri mentioned how about other workload like the hello-world.py : just download the py form https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted and run About the example of SD, could you please also show the conda list and pip list ? and check the Keras version, if it is > 3.0, please change to old version like 2.14 and try again. thank you! |
I tried with a smaller image size, 256x256 and it still get the same issue. When I run the text to image command it is giving an error that says 0MB memory is available and can't identify the PCI bus where the ARC gpu is located. After the error is raised then the python kernel dies and automatically resets. It seems to me that this is presenting as insufficient memory because the program cannot access the card, it's VRAM, or both. |
``> @djsv23 thank you a lot for checking. Then the libstdc++ library in your environment is correct. Please change back, the problem is not related to it.
I ran the HelloWorld script which blew up with an error. The repository suggested to run diagnostics.py from the OneAPI toolkit so I installed that and ran it to get this report:
Seeing the one read access error, I ran again as root:
@srinarayan-srikanthan @yinghu5 This may be of interest to both, but I did notice the mention if i915_gpu_info. When I go back through the driver installation instructions, it says there is an option to install the out-of-tree driver modules, including intel_i915_dkms among others. I was able to install the others, but the i915_dkms package appears to be incompatible with the 6.5 kernel I am running. The rest of the documentation suggests that 6.x kernels do not require the out of tree drivers; is it possible that there is something in the out-of-tree package that is required for the intel tensorflow extension that has not been upstreamed? |
@djsv23 thank you a lot! then we are back to the driver again :) Before you go ahead reinstall driver, system etc, let's check if your A750 work or not: then on the other hand, System BIOS configuration can have a significant impact on your GPU: https://www.intel.com/content/www/us/en/support/articles/000091128/graphics.html · Above 4G Decoding -> Enabled I'm not sure if they are default on your system, but here is reference. |
Thanks. I am sure the GPU is working as I have been gaming with it since day 1. Resizeable BAR and above 4G decoding are enabled. I also had to disable the iGPU on my 7700x cpu in BIOS to stop Proton for Steam from using it instead of the ARC card so all display output is definitely using the Intel GPU |
Also I have been doing video encoding with the card and see that FFMPEG is able to access the video engine and VRAM |
@djsv23 nice to know! how was the output in this machine? and lspci -s 03:00.0 -vvv? |
and we also have
|
@djsv23 thank you for the test. Then your driver is almost same as mine (but my system is A770 have 16G memory)_, just i915 is not there, but the second command shows your i915 works ok. From your SD run result, it seems the iteration happened, but failed later. Have you gotten chance to other machine, or try other inference code, like the one https://github.com/intel/intel-extension-for-tensorflow/blob/main/examples/quick_example.md import numpy as np import tensorflow as tf Conv + ReLU activation + BiasN = 1 x = np.random.rand(N, input_width, input_height, num_channel).astype(np.float32) conv = tf.nn.conv2d(x, weight, strides=[1, 1, 1, 1], padding='SAME') print(result) conda activate your environment Thanks |
@yinghu5 Wild! I ran the quick_example.py and it gave an output as expected, though it is still saying there is 0MB VRAM and the PCI bus ID is undefined:
I scaled up the parameters in the example to 9 channels and 30x input dimensions and 30x filter dimensions to create a heavier load and ran it in a loop to check and observe GPU utilization, and it seems this implementation is working.
It seems there is still some issue with my i915 driver that is not allowing all GPU information to be accessible, and some things are able to handle it more gracefully than others |
@djsv23 Good that you were able to get it working. The issue of 0Mb VRAM you are seeing is because when Tensorflow is able to detect a device but not identify the mem it is defaulting to 0. It is not an issue. And going by your observation of running a heavier load, the issue with not being able to run Stable diffusion is only the limitation from mem bottleneck of 8GB. |
Thanks all for the help - it seems that 8GB VRAM is quite limiting in AI image generation and might require some offloading strategies. I was able to get a 128x128 image to generate, which unfortunately isn't enough to create a meaningful image from the prompt and from xpu-smi appears to have required nearly 7GB to process. |
On Device: Intel ARC A750, operating system Ubuntu 22.04. Similar to #59, I've followed the installation procedure, and I've followed the instructions to ensure onemlk is activated. running env_check.sh still gives the error about not finding cuda drivers:
`
Check Environment for Intel(R) Extension for TensorFlow*...
======================== Check Python ========================
python3.9 is installed.
==================== Check Python Passed =====================
========================== Check OS ==========================
OS ubuntu:22.04 is Supported.
====================== Check OS Passed =======================
====================== Check Tensorflow ======================
Tensorflow2.14 is installed.
================== Check Tensorflow Passed ===================
=================== Check Intel GPU Driver ===================
Intel(R) graphics runtime intel-level-zero-gpu-1.3.27191.42-775 is installed, but is not recommended .
Intel(R) graphics runtime intel-opencl-icd-23.35.27191.42-775 is installed, but is not recommended .
Intel(R) graphics runtime level-zero-1.14.0-744 is installed, but is not recommended .
Intel(R) graphics runtime libigc1-1.0.15136.24-775 is installed, but is not recommended .
Intel(R) graphics runtime libigdfcl1-1.0.15136.24-775 is installed, but is not recommended .
Intel(R) graphics runtime libigdgmm12-22.3.12-742 is installed, but is not recommended .
=============== Check Intel GPU Driver Finshed ================
===================== Check Intel oneAPI =====================
Intel(R) oneAPI DPC++/C++ Compiler is installed.
Intel(R) oneAPI Math Kernel Library is installed.
================= Check Intel oneAPI Passed ==================
========================== Check Devices Availability ==========================
2024-02-01 10:55:06.186663: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-02-01 10:55:06.188042: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-01 10:55:06.206690: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-01 10:55:06.206709: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-01 10:55:06.206733: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-01 10:55:06.211135: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-01 10:55:06.211256: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-01 10:55:06.688400: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-02-01 10:55:06.965341: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
2024-02-01 10:55:08.016040: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
2024-02-01 10:55:08.090198: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
2024-02-01 10:55:08.090492: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-02-01 10:55:08.563296: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
====================== Check Devices Availability Passed =======================
`
The text was updated successfully, but these errors were encountered: