Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Stop breaking backwards compatibility or at least warn #1386

Open
danielzgtg opened this issue Dec 22, 2023 · 5 comments
Open

[Bug]: Stop breaking backwards compatibility or at least warn #1386

danielzgtg opened this issue Dec 22, 2023 · 5 comments
Assignees

Comments

@danielzgtg
Copy link

Describe the bug

rocBLAS 5.6 fails with a confusing error message when mixed with ROCm 6.0 libraries or TensileLibrary.

To Reproduce

Precise version of rocBLAS installed or rocBLAS commit hash if building from source.
Steps to reproduce the behavior:

  1. Install ROCm 6.0
  2. pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6
  3. Install https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6/tree/rocm
  4. Run https://www.llamaindex.ai/ or https://github.com/AUTOMATIC1111/stable-diffusion-webui

Expected behavior

I should not have to spend an hour debugging this, and only find the problem using gdb. rocBLAS 5.6 should either succeed or give a clear error message when loading the TensileLibrary from rocBLAS 6.0 or when loaded while mixed in with ROCm shared libraries.

Log-files

$ ./main.py
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/configuration_stablelm_epoch.py HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/configuration_stablelm_epoch.py HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/modeling_stablelm_epoch.py HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/modeling_stablelm_epoch.py HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/generation_config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/generation_config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
DEBUG:llama_index.readers.file.base:> [SimpleDirectoryReader] Total files added: 1
> [SimpleDirectoryReader] Total files added: 1
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: What I Worked On

February 2021

Before college...
> Adding chunk: What I Worked On

February 2021

Before college...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: I couldn't have put this into words when I was ...
> Adding chunk: I couldn't have put this into words when I was ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: So I looked around to see what I could salvage ...
> Adding chunk: So I looked around to see what I could salvage ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: I didn't want to drop out of grad school, but h...
> Adding chunk: I didn't want to drop out of grad school, but h...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: We actually had one of those little stoves, fed...
> Adding chunk: We actually had one of those little stoves, fed...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: But Interleaf still had a few years to live yet...
> Adding chunk: But Interleaf still had a few years to live yet...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Painting students were supposed to express them...
> Adding chunk: Painting students were supposed to express them...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Meanwhile I'd been hearing more and more about ...
> Adding chunk: Meanwhile I'd been hearing more and more about ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: In return for that and doing the initial legal ...
> Adding chunk: In return for that and doing the initial legal ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Which meant being easy to use and inexpensive. ...
> Adding chunk: Which meant being easy to use and inexpensive. ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Nor had I changed my grad student lifestyle sig...
> Adding chunk: Nor had I changed my grad student lifestyle sig...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Now when I walked past charming little restaura...
> Adding chunk: Now when I walked past charming little restaura...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: A lot of Lisp hackers dream of building a new L...
> Adding chunk: A lot of Lisp hackers dream of building a new L...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Over the next several years I wrote lots of ess...
> Adding chunk: Over the next several years I wrote lots of ess...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: So we just made what seemed like the obvious ch...
> Adding chunk: So we just made what seemed like the obvious ch...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: I don't think it was entirely luck that the fir...
> Adding chunk: I don't think it was entirely luck that the fir...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: YC was different from other kinds of work I've ...
> Adding chunk: YC was different from other kinds of work I've ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: For the rest of 2013 I left running YC more and...
> Adding chunk: For the rest of 2013 I left running YC more and...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Now they are, though. Now you could continue us...
> Adding chunk: Now they are, though. Now you could continue us...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Notes

[1] My experience skipped a step in the ...
> Adding chunk: Notes

[1] My experience skipped a step in the ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Startups had once been much more expensive to s...
> Adding chunk: Startups had once been much more expensive to s...

rocBLAS error: Could not load /opt/rocm-6.0.0/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat

rocBLAS error: Could not initialize Tensile library
Aborted (core dumped)

Environment

Hardware description
CPU AMD Ryzen 9 5900X 12-Core Processor
GPU AMD Radeon RX 6650 XT
Software version
rocm-core 6.0.0.60000-91~22.04
rocblas 4.0.0.60000-91~22.04

environment.txt

Workaround

Recompile pytorch manually. This will ensure that it loads shared libraries from /opt instead of venv.

@mahmoodw
Copy link
Contributor

Hello @danielzgtg,

Thank you for flagging the need for clearer error messages with ROCm and library version mismatches. Your feedback is vital in refining our library's usability.

Our team will investigate and refine the error notifications to offer guidance for resolving library version disparities. Additionally, we'll clarify any backward compatibility restrictions to assist users in navigating version conflicts more effectively.

We'll keep you updated on our progress as we work to enhance the error messages. Your patience and any additional insights during this process are immensely valuable.

Wasiq

@rkamd
Copy link
Contributor

rkamd commented Jan 2, 2024

@danielzgtg ,
Thanks for reporting the issue, Do you see Tensile Library files in the path?
output of this command find /opt/ -name "TensileLibrary_*.dat" would help to debug further.

@ghost
Copy link

ghost commented Jan 7, 2024

That explains it. Spent the last week troubleshooting why Rocm suddenly stopped working, turns out to be a backwards compatibility issue. Quite frustrating.

@rkamd
Copy link
Contributor

rkamd commented Jan 15, 2024

@danielzgtg and @Trat8547 ,
we were able to execute the sample rocblas program between the release ROCm 5.6 and ROCm 6.0, and internally we have not received any backward compatibility issues from the Frameworks team either.

Having said that, In general when a major version changes ( we follow semantic versioning) API breaking is expected, and upon reviewing the Release notes we see breaking changes in the HIP, and appropriate notification is published here.

Those changes could have contributed to the issue reported here.

@amcamd amcamd transferred this issue from ROCm/rocBLAS Jan 16, 2024
@amcamd amcamd transferred this issue from ROCm/ROCm Jan 16, 2024
@danielzgtg
Copy link
Author

Here: TensorLibrary.txt. I think the TensileLibrary_*.dat files are fine, and the problem is with the (lack of) version detection in the code that reads them.

Your linked https://rocm.docs.amd.com/en/latest/about/release-notes.html#hip appears to only list API breaking changes. What my issue is about is ABI breaking changes.

The problem is that the pytorch ROCm is bundling .so files that overlap with the system versions in /opt/. Perhaps deleting the libroc* files from venv/lib/python3.11/site-packages/torch/lib/ would force the correct version (i.e. the system versions) to be used. Anyway, my issues on the other AMD repo suggested that you fix this unnecessary shared library bundling problem with pytorch, but perhaps rocBLAS itself should detect this problem. I think glibc does this properly and refuses to let the application run if the wrong version is used.

This is why rebuilding pytorch was a workaround for this problem. But I would rather not wait for the long pytorch compile every time, and I also don't want the prepackaged pytorch builds to contain the libroc*.so files that not only inflate the download size to gigabytes or so but furthermore cause version conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants