Finetuned LLM model conversion to GGUF - performance drop #8033

TomekPro · 2024-06-20T11:43:45Z

TomekPro
Jun 20, 2024

Hi, I’m finetunning LLM on my data using SFTTrainer, bitsandbytes quatization and peft with configs like listed below. When I convert the model to GGUF for CPU inference, the model performance significantly drops. Any idea what could be a problem?

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    )

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM"

)

I do conversion to gguf in the following way. First, merge trained adapter with base model. Then such merged model is converted to gguf using llama.cpp, ‘convert.py’ script, I do q8_0 quantization, tested other types without success. I tested as well conversion using unsloath, as well w/o positive result.

python convert.py <MERGED_MODEL_PATH>
--outfile <OUTPUT_MODEL_NAME.gguf>
--outtype q8_0
--vocab_dir <ADAPTER_MODEL_PATH>

chris-hoertnagl · 2024-06-30T18:24:20Z

chris-hoertnagl
Jun 30, 2024

Hey, I am running similar experiments and have the following observations:

Of course model performance drops with quantization, so make sure you are comparing 2 quantized models
Did you merge and unload a model that was already quantized? If yes and you are quantizing it then again, this is probably bad for performance
What worked for me: Do QLorA like you with bnb quant 4bit; merge and unload with bf16 model; convert merged model to Q4 & gguf
Hope that helps 😄

4 replies

TomekPro Jul 31, 2024
Author

Hi @chris-hoertnagl, actually I do almost exacly what you describe in point 3. Correct?

Load a base quantized model:

            model = AutoModelForCausalLM.from_pretrained(
                model_path,
                quantization_config=bnb_config,
                device_map={"": 0},
                trust_remote_code=True,
                use_cache=False,
            )

Train it, and adapter is bf16:

    args = TrainingArguments(
        output_dir=OUTPUT_MODEL_DIR,
        num_train_epochs=ft_parameters.EPOCHS,
        per_device_train_batch_size=ft_parameters.BATCH_SIZE,  # 4 --> 2 becouse gpu
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=ft_parameters.ACCUMULATION_STEP,
        gradient_checkpointing=True,
        optim="paged_adamw_32bit",
        logging_steps=10,
        save_strategy="epoch",
        learning_rate=ft_parameters.LEARINIG_RATE,
        bf16=True,
        tf32=True,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        lr_scheduler_type="constant",
        evaluation_strategy="epoch" if ft_parameters.DO_EVAL else "no",
        report_to="none",
        disable_tqdm=False,  # disable tqdm since with packing values are in correct
        full_determinism=True,
    )

Merge (bf16)

model = AutoPeftModelForCausalLM.from_pretrained(
    lora_adapter_dir, 
    device_map=device_map, 
    torch_dtype=torch.bfloat16
    )

Convert, here I use q8 instead of q4:

python convert.py <MERGED_MODEL_PATH> \
  --outfile <OUTPUT_MODEL_NAME.gguf> \
  --outtype q8_0 \
  --vocab_dir <ADAPTER_MODEL_PATH>

chris-hoertnagl Jul 31, 2024

I would assume that depends on what happens in 3. because I am not sure if you previously saved the training run with bnb quant config and then load it again with torch_dtype, if it actually does that in this type even though your "lora_adapter_dir" specifies a quantized model.
Could you double check that the model you load in 1. and train in 2. is quantized, but when you load it in 3. it is not quantized. There is probably elegant ways to do this, a simple one would be to just check size of the model on GPU. If it is the case then merging and unloading the model 3. and using it in 4. to convert should work. BUT I would also try and use a similar quantization used during training, there is probably a mathematical explanation for this, I can just say from gut feeling that this will have performance impact. (You tuned a model adapter in 4Bit, then load it in 16Bit and then convert it to 8Bit - that feels like trouble)

TomekPro Aug 5, 2024
Author

Thanks @chris-hoertnagl, I digged into the problem a little bit more.
First source of performance degradations I had was merging and it was solved by dequantizing the model before merging
Second problem is "double quantization" - described here: #8868

chris-hoertnagl Aug 6, 2024

That sounds like pretty much what I described, no? Glad you could solve it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuned LLM model conversion to GGUF - performance drop #8033

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Finetuned LLM model conversion to GGUF - performance drop #8033

TomekPro Jun 20, 2024

Replies: 1 comment · 4 replies

chris-hoertnagl Jun 30, 2024

TomekPro Jul 31, 2024 Author

chris-hoertnagl Jul 31, 2024

TomekPro Aug 5, 2024 Author

chris-hoertnagl Aug 6, 2024

TomekPro
Jun 20, 2024

Replies: 1 comment 4 replies

chris-hoertnagl
Jun 30, 2024

TomekPro Jul 31, 2024
Author

TomekPro Aug 5, 2024
Author