Skip to content

Feature Request: Direct FP8 conversion from convert_hf_to_gguf.py #14762

@bmtwl

Description

@bmtwl

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Hello, I've long wanted a way to go straight from FP8 models to a q8 gguf and took a swing at it by modifying prepare_tensors to track tensors and marry up the scale tensor/multiplying it into a float32 tensor: https://github.com/bmtwl/llama.cpp/blob/convert-hf-from-fp8/convert_hf_to_gguf.py
The conversion starts correctly, but fails on writing due to size/shape differences, which is where I'm losing the plot:
RuntimeError: The size of tensor a (18432) must match the size of tensor b (144) at non-singleton dimension 1
I'm hoping the remaining issue is something small. Anyone want to take a look and figure out where I'm going wrong?

Motivation

Right now safetensors need to be converted to BF16 before being quanted into anything else, a step that is theoretically not needed

Possible Implementation

deal with the scale part and force the FP8 tensors into another format before converting

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions