Tracking issue for SPHINX quantization & other memory issues

We have currently received several requests (#112, #110, #97) to run the SPHINX inference on GPUs with smaller memory. We also believe that fitting it into the 24GB memory bar benefits a broad range of users who would like to run the model locally on commodity GPUs like 3090 or 4090.

With the latest update #113, we should see NF4 quantization running fine on SPHINX without errors (i.e., resolving #97). The memory usage is a bit less than 23GB, and it should now fit into a single 24GB GPU (3090, 4090 or A5000) even with ECC turned on 

![image](https://github.com/Alpha-VLLM/LLaMA2-Accessory/assets/53928811/418c4648-7c22-4418-8772-f81196c22cba)

We are still doing a complete benchmark of this quantized model and will update the latest information under this issue. Meanwhile, any question is also welcomed :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for SPHINX quantization & other memory issues #114

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tracking issue for SPHINX quantization & other memory issues #114

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions