Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to quantize NVILA with awq? #174

Open
Kibry-spin opened this issue Dec 30, 2024 · 0 comments
Open

How to quantize NVILA with awq? #174

Kibry-spin opened this issue Dec 30, 2024 · 0 comments

Comments

@Kibry-spin
Copy link

(vila) kirdo@kirdo-System-Product-Name:~/LLM/llm-awq$ python -m awq.entry --model_path /home/kirdo/LLM/NVILA-8B-Video/ --w_bit 4 --q_group_size 128 --run_awq --dump_awq awq_cache/$MODEL-w4-g128.pt
Quantization config: {'zero_point': True, 'q_group_size': 128}

  • Building model /home/kirdo/LLM/NVILA-8B-Video/
    [2024-12-30 19:26:13,027] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.05it/s]
    You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
    Repo card metadata block was not found. Setting CardData to empty.
    Token indices sequence length is longer than the specified maximum sequence length for this model (57053 > 16384). Running this sequence through the model will result in indexing errors
  • Split into 59 blocks
    Traceback (most recent call last):
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
    File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 352, in
    main()
    File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 293, in main
    model, enc = build_model_and_enc(args.model_path)
    File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 199, in build_model_and_enc
    awq_results = run_awq(
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
    File "/home/kirdo/LLM/llm-awq/awq/quantize/pre_quant.py", line 136, in run_awq
    model.llm(samples.to(next(model.parameters()).device))
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward
    outputs = self.model(
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 871, in forward
    position_embeddings = self.rotary_emb(hidden_states, position_ids)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 163, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

I just follow the https://github.com/mit-han-lab/llm-awq.git to quantize NVILA-8B-Video,but when running awq search, i got an problem shown above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant