UserWarning: AutoAWQ could not load kernels extension #586

Endote · 2024-08-16T16:18:02Z

Endote
Aug 16, 2024

I am running into an issue while quantizing llama3 8b instruct both safetensors and gguf.

this is what I am getting

C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\awq\modules\linear\exllama.py:12: UserWarning: AutoAWQ could not load ExLlama kernels extension. Details: DLL load failed while importing exl_ext: Nie można odnaleźć określonego modułu.
  warnings.warn(f"AutoAWQ could not load ExLlama kernels extension. Details: {ex}")
C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\awq\modules\linear\exllamav2.py:13: UserWarning: AutoAWQ could not load ExLlamaV2 kernels extension. Details: DLL load failed while importing exlv2_ext: Nie można odnaleźć określonego modułu.
  warnings.warn(f"AutoAWQ could not load ExLlamaV2 kernels extension. Details: {ex}")
C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\awq\modules\linear\gemm.py:14: UserWarning: AutoAWQ could not load GEMM kernels extension. Details: DLL load failed while importing awq_ext: Nie można odnaleźć określonego modułu.  warnings.warn(f"AutoAWQ could not load GEMM kernels extension. Details: {ex}")
C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\awq\modules\linear\gemv.py:11: UserWarning: AutoAWQ could not load GEMV kernels extension. Details: DLL load failed while importing awq_ext: Nie można odnaleźć określonego modułu.  warnings.warn(f"AutoAWQ could not load GEMV kernels extension. Details: {ex}")
C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\awq\modules\linear\gemv_fast.py:10: UserWarning: AutoAWQ could not load GEMVFast kernels extension. Details: DLL load failed while importing awq_v2_ext: Nie można odnaleźć określonego modułu.
  warnings.warn(f"AutoAWQ could not load GEMVFast kernels extension. Details: {ex}")
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.32s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Repo card metadata block was not found. Setting CardData to empty.
AWQ:   0%|                                                                                                                                                       | 0/32 [00:00<?, ?it/s]

and it loads indefinitely, no estimation, no nothing

The code I have been using

### Script to quantize Llama3 models
### Make sure to get
### pip install autoawq-kernels

import os
import subprocess

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'models\meta-llama-3-8b-instruct-hf'
quant_path = 'models\meta-llama-3-8b-instruct-awq-4bit'
llama_cpp_path = '..\..\llama.cpp'
quant_config = {'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM',}

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True, **{"low_cpu_mem_usage": True, "use_cache": False})
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Save quantized model
model.quantize(tokenizer=tokenizer, quant_config=quant_config,export_compatible=True)
model.save_quantized(quant_path, safetensors=True)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')

# GGUF conversion
print('Converting model to GGUF...')
llama_cpp_method = "q4_K_M"
convert_cmd_path = os.path.join(llama_cpp_path, "convert.py")
quantize_cmd_path = os.path.join(llama_cpp_path, "quantize")

subprocess.run([
    f"python {convert_cmd_path} {quant_path} --outfile {quant_path}/model.gguf"
], shell=True, check=True)

subprocess.run([
    f"{quantize_cmd_path} {quant_path}/model.gguf {quant_path}/model_{llama_cpp_method}.gguf {llama_cpp_method}"
], shell=True, check=True)### Script to quantize Llama3 models
### Make sure to get
### pip install autoawq-kernels

import os
import subprocess

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'models\meta-llama-3-8b-instruct-hf'
quant_path = 'models\meta-llama-3-8b-instruct-awq-4bit'
llama_cpp_path = '..\..\llama.cpp'
quant_config = {'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM',}

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True, **{"low_cpu_mem_usage": True, "use_cache": False})
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Save quantized model
model.quantize(tokenizer=tokenizer, quant_config=quant_config,export_compatible=True)
model.save_quantized(quant_path, safetensors=True)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')

# GGUF conversion
print('Converting model to GGUF...')
llama_cpp_method = "q4_K_M"
convert_cmd_path = os.path.join(llama_cpp_path, "convert.py")
quantize_cmd_path = os.path.join(llama_cpp_path, "quantize")

subprocess.run([
    f"python {convert_cmd_path} {quant_path} --outfile {quant_path}/model.gguf"
], shell=True, check=True)

subprocess.run([
    f"{quantize_cmd_path} {quant_path}/model.gguf {quant_path}/model_{llama_cpp_method}.gguf {llama_cpp_method}"
], shell=True, check=True)

For further reference I am using Windows 10 machine with nvidia 4080 and 64GB of ram.

Please help me find out what is wrong, otherwise I will have to look for other quantization methods :/

casper-hansen · 2024-08-16T16:51:39Z

casper-hansen
Aug 16, 2024
Maintainer

Can you try running the following command and pasting the full output?

pip install -v -U --force-reinstall autoawq_kernels

1 reply

Endote Aug 16, 2024
Author

The issue still persists unfortunately. The real annoying part is that uppon inspection in the modules I find the import code, which should be working after pip install autoawq-kernels

try:
    import exl_ext  # with CUDA kernels (AutoAWQ_kernels)

    EXL_INSTALLED = True
except Exception as ex:
    EXL_INSTALLED = False
    warnings.warn(f"AutoAWQ could not load ExLlama kernels extension. Details: {ex}")

Is there something else that I am missing?
My pip modules just to be clear

(env) C:\Users\1\Desktop\projects\LLM>pip freeze
accelerate==0.31.0
aiofiles==23.2.1
aiohttp==3.9.5
aiosignal==1.3.1
altair==5.3.0
annotated-types==0.7.0
anyio==4.4.0
async-timeout==4.0.3
attrs==23.2.0
autoawq==0.2.6
autoawq_kernels==0.0.7
beautifulsoup4==4.12.3
behave==1.2.6
blobfile==2.1.1
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
contourpy==1.2.1
cycler==0.12.1
datasets==2.21.0
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
dnspython==2.6.1
docstring_parser==0.16
einops==0.8.0
email_validator==2.1.1
exceptiongroup==1.2.1
fairscale==0.4.13
fastapi==0.111.0
fastapi-cli==0.0.4
ffmpy==0.3.2
filelock==3.15.4
fire==0.6.0
fonttools==4.53.0
frozenlist==1.4.1
fsspec==2024.6.1
gguf==0.9.1
gitdb==4.0.11
GitPython==3.1.43
Gloo==0.1.2
gradio==4.36.1
gradio_client==1.0.1
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.24.5
idna==3.7
importlib_resources==6.4.0
intel-openmp==2021.4.0
Jinja2==3.1.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
-e git+https://github.com/meta-llama/llama3.git@14aab0428d3ec3a9596f1dea06d9c564f9c0e35f#egg=llama3
llama_cpp_python==0.2.82
lxml==4.9.4
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
mdurl==0.1.2
mkl==2021.4.0
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
networkx==3.3
ninja==1.11.1.1
numpy==1.26.4
nvidia-ml-py3==7.352.0
openai==1.30.5
orjson==3.10.4
packaging==24.0
pandas==2.2.2
parse==1.20.2
parse-type==0.6.2
pillow==10.2.0
prometheus_client==0.20.0
protobuf==4.25.3
psutil==5.9.8
pyarrow==17.0.0
pycparser==2.22
pycryptodomex==3.20.0
pydantic==2.6.4
pydantic_core==2.16.3
pydub==0.25.1
Pygments==2.18.0
pyparsing==3.1.2
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
rpds-py==0.18.1
ruff==0.4.8
safetensors==0.4.3
semantic-version==2.10.0
sentencepiece==0.2.0
shellingham==1.5.4
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.6
starlette==0.37.2
sympy==1.13.2
tabulate==0.9.0
tbb==2021.13.1
termcolor==2.4.0
tiktoken==0.4.0
tokenizers==0.19.1
tomlkit==0.12.0
toolz==0.12.1
torch==2.3.1
torchaudio==2.3.1
torchvision==0.18.0
tqdm==4.66.4
transformers==4.42.0
typer==0.12.3
typing_extensions==4.12.2
tzdata==2024.1
ujson==5.10.0
urllib3==2.2.1
uvicorn==0.30.1
watchfiles==0.22.0
websockets==11.0.3
xxhash==3.4.1
yarl==1.9.4
zstandard==0.23.0

casper-hansen · 2024-08-16T17:15:47Z

casper-hansen
Aug 16, 2024
Maintainer

How long did you let it hang? Did you check if it loaded onto your GPU? Do you have WSL2 installed?

5 replies

Endote Aug 16, 2024
Author

I do not have the WSL2, that must be it! I let it hang for over an hour or so before writing here, how long is it supposed to run by estimate?

casper-hansen Aug 16, 2024
Maintainer

For this model, it should be 7-20 minutes dependent on how performant the setup is

Endote Aug 17, 2024
Author

Hey Casper, sorry for the noob questions ahead.

I have managed to install the WSL2 Ubuntu,

(wslenv) endote@DESKTOP-0LPT7H4:/mnt/c/Users/1/Desktop/projects/LLM/llm-awq/awq/kernels$ sudo -E python3 setup.py install
running install
/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()
/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` and ``easy_install``.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://github.com/pypa/setuptools/issues/917 for details.
        ********************************************************************************

!!
  self.initialize_options()
running bdist_egg
running egg_info
writing awq_inference_engine.egg-info/PKG-INFO
writing dependency_links to awq_inference_engine.egg-info/dependency_links.txt
writing requirements to awq_inference_engine.egg-info/requires.txt
writing top-level names to awq_inference_engine.egg-info/top_level.txt
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:387: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'awq_inference_engine.egg-info/SOURCES.txt'
writing manifest file 'awq_inference_engine.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
building 'awq_inference_engine' extension
/usr/local/cuda-11.5/bin/nvcc -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda-11.5/include -I/usr/include/python3.10 -c csrc/attention/decoder_masked_multihead_attention.cu -o build/temp.linux-x86_64-cpython-310/csrc/attention/decoder_masked_multihead_attention.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -ccbin=/usr/bin/gcc-9 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
csrc/attention/decoder_masked_multihead_attention_template.hpp(989): warning #177-D: variable "v_offset" was declared but never referenced
          detected during:
            instantiation of "void mmha_launch_kernel<T,Dh,Dh_MAX,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, Dh=32, Dh_MAX=32, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]"
csrc/attention/decoder_masked_multihead_attention.cu(70): here
            instantiation of "void multihead_attention_<T,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]"
csrc/attention/decoder_masked_multihead_attention.cu(111): here

csrc/attention/decoder_masked_multihead_attention_template.hpp(995): warning #177-D: variable "v_bias_offset" was declared but never referenced
          detected during:
            instantiation of "void mmha_launch_kernel<T,Dh,Dh_MAX,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, Dh=32, Dh_MAX=32, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]"
csrc/attention/decoder_masked_multihead_attention.cu(70): here
            instantiation of "void multihead_attention_<T,KERNEL_PARAMS_TYPE>(const KERNEL_PARAMS_TYPE &, const cudaStream_t &) [with T=float, KERNEL_PARAMS_TYPE=Multihead_attention_params<float, false>]"
csrc/attention/decoder_masked_multihead_attention.cu(111): here


/usr/bin/g++-9 -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda-11.5/include -I/usr/include/python3.10 -c csrc/attention/ft_attention.cpp -o build/temp.linux-x86_64-cpython-310/csrc/attention/ft_attention.o -g -O3 -fopenmp -lgomp -std=c++17 -DENABLE_BF16 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0
csrc/attention/ft_attention.cpp: In instantiation of ‘void set_params(Masked_multihead_attention_params<T>&, size_t, size_t, size_t, size_t, size_t, int, int, float, float, bool, int, T*, T*, T*, T*, T*, int*, float*, T*) [with T = short unsigned int; Masked_multihead_attention_params<T> = Multihead_attention_params<short unsigned int, false>; size_t = long unsigned int]’:
csrc/attention/ft_attention.cpp:166:5:   required from here
csrc/attention/ft_attention.cpp:73:11: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘Masked_multihead_attention_params<short unsigned int>’ {aka ‘struct Multihead_attention_params<short unsigned int, false>’}; use assignment or value-initialization instead [-Wclass-memaccess]
   73 |     memset(&params, 0, sizeof(params));
      |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from csrc/attention/ft_attention.cpp:8:
csrc/attention/decoder_masked_multihead_attention.h:122:8: note: ‘Masked_multihead_attention_params<short unsigned int>’ {aka ‘struct Multihead_attention_params<short unsigned int, false>’} declared here
  122 | struct Multihead_attention_params: public Multihead_attention_params_base<T> {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~
csrc/attention/ft_attention.cpp: In instantiation of ‘void set_params(Masked_multihead_attention_params<T>&, size_t, size_t, size_t, size_t, size_t, int, int, float, float, bool, int, T*, T*, T*, T*, T*, int*, float*, T*) [with T = __nv_bfloat16; Masked_multihead_attention_params<T> = Multihead_attention_params<__nv_bfloat16, false>; size_t = long unsigned int]’:
csrc/attention/ft_attention.cpp:166:5:   required from here
csrc/attention/ft_attention.cpp:73:11: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘Masked_multihead_attention_params<__nv_bfloat16>’ {aka ‘struct Multihead_attention_params<__nv_bfloat16, false>’}; use assignment or value-initialization instead [-Wclass-memaccess]
   73 |     memset(&params, 0, sizeof(params));
      |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from csrc/attention/ft_attention.cpp:8:
csrc/attention/decoder_masked_multihead_attention.h:122:8: note: ‘Masked_multihead_attention_params<__nv_bfloat16>’ {aka ‘struct Multihead_attention_params<__nv_bfloat16, false
’} declared here
  122 | struct Multihead_attention_params: public Multihead_attention_params_base<T> {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~
csrc/attention/ft_attention.cpp: In instantiation of ‘void set_params(Masked_multihead_attention_params<T>&, size_t, size_t, size_t, size_t, size_t, int, int, float, float, bool, int, T*, T*, T*, T*, T*, int*, float*, T*) [with T = float; Masked_multihead_attention_params<T> = Multihead_attention_params<float, false>; size_t = long unsigned int]’:
csrc/attention/ft_attention.cpp:166:5:   required from here
csrc/attention/ft_attention.cpp:73:11: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘Masked_multihead_attention_params<float>’ {aka ‘struct Multihead_attention_params<float, false>’}; use assignment or value-initialization instead [-Wclass-memaccess]
   73 |     memset(&params, 0, sizeof(params));
      |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from csrc/attention/ft_attention.cpp:8:
csrc/attention/decoder_masked_multihead_attention.h:122:8: note: ‘Masked_multihead_attention_params<float>’ {aka ‘struct Multihead_attention_params<float, false>’} declared here
  122 | struct Multihead_attention_params: public Multihead_attention_params_base<T> {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/cuda-11.5/bin/nvcc -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda-11.5/include -I/usr/include/python3.10 -c csrc/layernorm/layernorm.cu -o build/temp.linux-x86_64-cpython-310/csrc/layernorm/layernorm.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -ccbin=/usr/bin/gcc-9 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
/usr/local/cuda-11.5/bin/nvcc -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda-11.5/include -I/usr/include/python3.10 -c csrc/position_embedding/pos_encoding_kernels.cu -o build/temp.linux-x86_64-cpython-310/csrc/position_embedding/pos_encoding_kernels.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -ccbin=/usr/bin/gcc-9 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
/usr/bin/g++-9 -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda-11.5/include -I/usr/include/python3.10 -c csrc/pybind.cpp -o build/temp.linux-x86_64-cpython-310/csrc/pybind.o -g -O3 -fopenmp -lgomp -std=c++17 -DENABLE_BF16 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0
/usr/local/cuda-11.5/bin/nvcc -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda-11.5/include -I/usr/include/python3.10 -c csrc/quantization/gemm_cuda_gen.cu -o build/temp.linux-x86_64-cpython-310/csrc/quantization/gemm_cuda_gen.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -ccbin=/usr/bin/gcc-9 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
csrc/quantization/gemm_cuda_gen.cu(34): warning #177-D: variable "ZERO" was declared but never referenced

csrc/quantization/gemm_cuda_gen.cu(44): warning #177-D: variable "blockIdx_x" was declared but never referenced

csrc/quantization/gemm_cuda_gen.cu(65): warning #177-D: variable "ld_zero_flag" was declared but never referenced

csrc/quantization/gemm_cuda_gen.cu(21): warning #177-D: function "__pack_half2" was declared but never referenced

/usr/local/cuda-11.5/bin/nvcc -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda-11.5/include -I/usr/include/python3.10 -c csrc/quantization/gemv_cuda.cu -o build/temp.linux-x86_64-cpython-310/csrc/quantization/gemv_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -ccbin=/usr/bin/gcc-9 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
csrc/quantization/gemv_cuda.cu(224): warning #177-D: variable "blockDim_z" was declared but never referenced

/usr/local/cuda-11.5/bin/nvcc -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda-11.5/include -I/usr/include/python3.10 -c csrc/quantization_new/gemm/gemm_cuda.cu -o build/temp.linux-x86_64-cpython-310/csrc/quantization_new/gemm/gemm_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -ccbin=/usr/bin/gcc-9 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
csrc/quantization_new/gemm/gemm_cuda.cu(300): warning #177-D: variable "blockIdx_x" was declared but never referenced

csrc/quantization_new/gemm/gemm_cuda.cu(320): warning #177-D: variable "kSmemSizeZeros" was declared but never referenced

csrc/quantization_new/gemm/gemm_cuda.cu(205): warning #177-D: variable "total_global_iters" was declared but never referenced
          detected during instantiation of "void gemm_w4a16_T1<CTA_M,CTA_N,CTA_K,WARP_M,WARP_N,WARP_K,STAGES,G,SPLITK>(half *, half *, half *, half *, half *, int *, int, int, int) [with CTA_M=16, CTA_N=128, CTA_K=128, WARP_M=16, WARP_N=32, WARP_K=64, STAGES=4, G=128, SPLITK=2]"

csrc/quantization_new/gemm/gemm_cuda.cu(753): warning #177-D: variable "blockIdx_x" was declared but never referenced

csrc/quantization_new/gemm/gemm_cuda.cu(755): warning #177-D: variable "blockIdx_z" was declared but never referenced

csrc/quantization_new/gemm/gemm_cuda.cu(771): warning #177-D: variable "kSmemSizeZeros" was declared but never referenced

csrc/quantization_new/gemm/gemm_cuda.cu(662): warning #177-D: variable "total_global_iters" was declared but never referenced
          detected during instantiation of "void gemm_w4a16_T2<CTA_M,CTA_N,CTA_K,WARP_M,WARP_N,WARP_K,STAGES,G>(half *, half *, half *, half *, half *, int, int, int) [with CTA_M=64, CTA_N=128, CTA_K=64, WARP_M=64, WARP_N=32, WARP_K=64, STAGES=4, G=128]"

csrc/quantization_new/gemm/gemm_cuda.cu(664): warning #177-D: variable "kSmemCol" was declared but never referenced
          detected during instantiation of "void gemm_w4a16_T2<CTA_M,CTA_N,CTA_K,WARP_M,WARP_N,WARP_K,STAGES,G>(half *, half *, half *, half *, half *, int, int, int) [with CTA_M=64, CTA_N=128, CTA_K=64, WARP_M=64, WARP_N=32, WARP_K=64, STAGES=4, G=128]"

/usr/local/cuda-11.5/bin/nvcc -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda-11.5/include -I/usr/include/python3.10 -c csrc/quantization_new/gemv/gemv_cuda.cu -o build/temp.linux-x86_64-cpython-310/csrc/quantization_new/gemv/gemv_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -ccbin=/usr/bin/gcc-9 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
csrc/quantization_new/gemv/gemv_cuda.cu(83): warning #177-D: variable "kShuffleSize" was declared but never referenced

creating build/lib.linux-x86_64-cpython-310
/usr/bin/g++-9 -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -shared -Wl,-O1 -Wl,-Bsymbolic-functions build/temp.linux-x86_64-cpython-310/csrc/attention/decoder_masked_multihead_attention.o build/temp.linux-x86_64-cpython-310/csrc/attention/ft_attention.o build/temp.linux-x86_64-cpython-310/csrc/layernorm/layernorm.o build/temp.linux-x86_64-cpython-310/csrc/position_embedding/pos_encoding_kernels.o build/temp.linux-x86_64-cpython-310/csrc/pybind.o build/temp.linux-x86_64-cpython-310/csrc/quantization/gemm_cuda_gen.o build/temp.linux-x86_64-cpython-310/csrc/quantization/gemv_cuda.o build/temp.linux-x86_64-cpython-310/csrc/quantization_new/gemm/gemm_cuda.o build/temp.linux-x86_64-cpython-310/csrc/quantization_new/gemv/gemv_cuda.o -L/usr/local/lib/python3.10/dist-packages/torch/lib -L/usr/local/cuda-11.5/lib64 -L/usr/lib/x86_64-linux-gnu -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-cpython-310/awq_inference_engine.cpython-310-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-cpython-310/awq_inference_engine.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for awq_inference_engine.cpython-310-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/awq_inference_engine.py to awq_inference_engine.cpython-310.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying awq_inference_engine.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying awq_inference_engine.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying awq_inference_engine.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying awq_inference_engine.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying awq_inference_engine.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.awq_inference_engine.cpython-310: module references __file__
creating 'dist/awq_inference_engine-0.0.0-py3.10-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing awq_inference_engine-0.0.0-py3.10-linux-x86_64.egg
removing '/usr/local/lib/python3.10/dist-packages/awq_inference_engine-0.0.0-py3.10-linux-x86_64.egg' (and everything under it)
creating /usr/local/lib/python3.10/dist-packages/awq_inference_engine-0.0.0-py3.10-linux-x86_64.egg
Extracting awq_inference_engine-0.0.0-py3.10-linux-x86_64.egg to /usr/local/lib/python3.10/dist-packages
Adding awq-inference-engine 0.0.0 to easy-install.pth file

Installed /usr/local/lib/python3.10/dist-packages/awq_inference_engine-0.0.0-py3.10-linux-x86_64.egg
Processing dependencies for awq-inference-engine==0.0.0
Searching for torch==1.11.0+cu115
Best match: torch 1.11.0+cu115
Adding torch 1.11.0+cu115 to easy-install.pth file
Installing convert-caffe2-to-onnx script to /usr/local/bin
Installing convert-onnx-to-caffe2 script to /usr/local/bin
Installing torchrun script to /usr/local/bin

Using /usr/local/lib/python3.10/dist-packages
Searching for typing-extensions==4.12.2
Best match: typing-extensions 4.12.2
Adding typing-extensions 4.12.2 to easy-install.pth file

Using /usr/local/lib/python3.10/dist-packages/setuptools/_vendor
Finished processing dependencies for awq-inference-engine==0.0.0

I have managed to run the python setup.py install with the results as above

Before that I got the proper cuda compatible torch version alongside cuda for the 11.5 version as shown below

(wslenv) endote@DESKTOP-0LPT7H4:/mnt/c/Users/1/Desktop/projects/LLM/llm-awq/awq/kernels$ sudo python3 test.py
1.11.0+cu115
True
11.5

But when I try to run the quantization code it doesn't find the awq module

import awq
from transformers import AutoTokenizer

model_path = 'models\meta-llama-3-8b-instruct-hf'
quant_path = 'models\meta-llama-3-8b-instruct-4-wbit'
quant_config = {'zero_point': True, 'q_group_size': 128, 'w_bit': 4}

model = awq.AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

model.quantize(tokenizer=tokenizer, quant_config=quant_config)

model.save_quantized(quant_path, safetensors=True)
tokenizer.save_pretrained(quant_path)

output

(wslenv) endote@DESKTOP-0LPT7H4:/mnt/c/Users/1/Desktop/projects/LLM/llm-awq/awq/kernels$ sudo python3 ../../../llama.cpp/moje/quantize.py
Traceback (most recent call last):
  File "/mnt/c/Users/1/Desktop/projects/LLM/llm-awq/awq/kernels/../../../llama.cpp/moje/quantize.py", line 11, in <module>
    model = awq.AutoAWQForCausalLM.from_pretrained(model_path)
AttributeError: module 'awq' has no attribute 'AutoAWQForCausalLM'

I am really stuck on this since yesterday and I feel like I am just inches away. Are you maybe aware of what is the issue here?

casper-hansen Aug 17, 2024
Maintainer

Hi @Endote, it looks like you installed the wrong kernels. You need to install https://github.com/casper-hansen/AutoAWQ_kernels

Endote Aug 17, 2024
Author

I have changed the CUDA to 12.1 and installed the autoawq-kernels from repo
https://github.com/casper-hansen/AutoAWQ_kernels/releases/download/v0.0.7/autoawq_kernels-0.0.7+cu118-cp310-cp310-linux_x86_64.whl

absl-py==2.1.0
accelerate==0.33.0
aiofiles==24.1.0
aiohappyeyeballs==2.3.6
aiohttp==3.10.3
aiosignal==1.3.1
altair==5.4.0
anyio==4.4.0
async-timeout==4.0.3
attributedict==0.3.0
attrs==24.2.0
autoawq_kernels @ https://github.com/casper-hansen/AutoAWQ_kernels/releases/download/v0.0.7/autoawq_kernels-0.0.7+cu118-cp310-cp310-linux_x86_64.whl#sha256=01204f0c5c83f87e4abf431cc965a7b54b8fc24d6301c1d51bba4644600ce17c
autocommand==2.2.2
-e git+https://github.com/mit-han-lab/llm-awq@3665e1abbf04139aa037254ff4ff3f261bd68a40#egg=awq
awq_inference_engine==0.0.0
backports.tarfile==1.2.0
blessings==1.7
blinker==1.4
cachetools==5.4.0
certifi==2024.7.4
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
codecov==2.1.13
colorama==0.4.6
coloredlogs==15.0.1
colour-runner==0.1.1
command-not-found==0.3
contourpy==1.2.1
coverage==7.6.1
cryptography==3.4.8
cycler==0.12.1
DataProperty==1.0.1
datasets==2.21.0
dbus-python==1.2.18
deepdiff==7.0.1
dill==0.3.8
distlib==0.3.8
distro==1.7.0
distro-info==1.1+ubuntu0.2
exceptiongroup==1.2.2
fastapi==0.112.1
ffmpy==0.4.0
filelock==3.15.4
fonttools==4.53.1
frozenlist==1.4.1
fsspec==2024.6.1
gradio==3.35.2
gradio_client==0.2.9
h11==0.14.0
httpcore==1.0.5
httplib2==0.20.2
httpx==0.27.0
huggingface-hub==0.24.5
humanfriendly==10.0
idna==3.7
importlib_metadata==8.0.0
importlib_resources==6.4.0
inflect==7.3.1
inspecta==0.1.3
jaraco.context==5.3.0
jaraco.functools==4.0.1
jaraco.text==3.12.1
jeepney==0.7.1
Jinja2==3.1.4
jiter==0.5.0
joblib==1.4.2
jsonlines==4.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
keyring==23.5.0
kiwisolver==1.4.5
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
linkify-it-py==2.0.3
lm-eval==0.3.0
markdown-it-py==2.2.0
MarkupSafe==2.1.5
matplotlib==3.9.2
mbstrdecoder==1.1.3
mdit-py-plugins==0.3.3
mdurl==0.1.2
more-itertools==10.3.0
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
narwhals==1.4.2
netifaces==0.11.0
networkx==3.3
nltk==3.8.1
numexpr==2.10.1
numpy==1.26.4
nvidia-cublas-cu115==11.7.4.6
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu115==11.5.117
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu115==8.3.3.40
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
nvidia-pyindex==1.0.9
oauthlib==3.2.0
openai==1.41.0
ordered-set==4.1.0
orjson==3.10.7
packaging==24.1
pandas==2.2.2
pathvalidate==3.2.0
pillow==10.4.0
platformdirs==4.2.2
pluggy==1.5.0
portalocker==2.10.1
protobuf==5.27.3
psutil==6.0.0
pyarrow==17.0.0
pybind11==2.13.4
pycountry==24.6.1
pydantic==1.10.14
pydub==0.25.1
Pygments==2.18.0
PyGObject==3.42.1
PyJWT==2.3.0
pyparsing==2.4.7
pyproject-api==1.7.1
pytablewriter==1.2.0
python-apt==2.4.0+ubuntu3
python-dateutil==2.9.0.post0
python-multipart==0.0.9
pytz==2024.1
PyYAML==5.4.1
referencing==0.35.1
regex==2024.7.24
requests==2.32.3
rootpath==0.1.1
rouge-score==0.1.2
rpds-py==0.20.0
sacrebleu==1.5.0
safetensors==0.4.4
scikit-learn==1.5.1
scipy==1.14.0
SecretStorage==3.3.1
semantic-version==2.10.0
sentencepiece==0.2.0
six==1.16.0
sniffio==1.3.1
sqlitedict==2.1.0
starlette==0.38.2
sympy==1.13.2
systemd-python==234
tabledata==1.3.3
tcolorpy==0.1.6
termcolor==2.4.0
texttable==1.7.0
threadpoolctl==3.5.0
tokenizers==0.15.2
toml==0.10.2
tomli==2.0.1
torch==2.3.1
torchaudio==0.11.0+cu115
torchvision==0.12.0+cu115
tox==4.18.0
tqdm==4.66.5
tqdm-multiprocess==0.0.11
transformers==4.36.2
triton==2.3.1
typeguard==4.3.0
typepy==1.3.2
typing_extensions==4.12.2
tzdata==2024.1
ubuntu-advantage-tools==8001
uc-micro-py==1.0.3
ufw==0.36.1
unattended-upgrades==0.1
urllib3==2.2.2
uvicorn==0.30.6
virtualenv==20.26.3
wadllib==1.3.6
websockets==12.0
xxhash==3.4.1
yarl==1.9.4
zipp==3.19.2
zstandard==0.23.0

Am I supposed to run the setup install script? If so some of the files in the script do not have the paths adjusted and I am stuck now on one file that is totally missing 'pybind_awq.cpp'

Can you please help me understand next steps properly?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UserWarning: AutoAWQ could not load kernels extension #586

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

UserWarning: AutoAWQ could not load kernels extension #586

Endote Aug 16, 2024

Replies: 2 comments · 6 replies

casper-hansen Aug 16, 2024 Maintainer

Endote Aug 16, 2024 Author

casper-hansen Aug 16, 2024 Maintainer

Endote Aug 16, 2024 Author

casper-hansen Aug 16, 2024 Maintainer

Endote Aug 17, 2024 Author

casper-hansen Aug 17, 2024 Maintainer

Endote Aug 17, 2024 Author

Endote
Aug 16, 2024

Replies: 2 comments 6 replies

casper-hansen
Aug 16, 2024
Maintainer

Endote Aug 16, 2024
Author

casper-hansen
Aug 16, 2024
Maintainer

Endote Aug 16, 2024
Author

casper-hansen Aug 16, 2024
Maintainer

Endote Aug 17, 2024
Author

casper-hansen Aug 17, 2024
Maintainer

Endote Aug 17, 2024
Author