[Neural_chat] Chat completion is very slow with neuralchat_server #1616

noobHappylife · 2024-06-17T08:07:32Z

I noticed when using neuralchat_server for a chat completion is very slow, compared to loading the model through AutoModelForCausalLM then do a generate (after applying chat_template).

Both cases, I'm using same quantization config and the same model

# with intel_extension_for_transformer 
RtnConfig(compute_dtype="fp32",weight_dtype="int4")

# yaml config with neuralchat_server 
optimization:
    use_neural_speed: true
    optimization_type: "weight_only"
    compute_dtype: "fp32"
    weight_dtype: "int4"

Is this slowdown with the neuralchat_server expected? Or is there other alternative to start an OpenAI compatible api server?

a32543254 assigned a32543254 and lvliang-intel and unassigned a32543254 Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Neural_chat] Chat completion is very slow with neuralchat_server #1616

[Neural_chat] Chat completion is very slow with neuralchat_server #1616

noobHappylife commented Jun 17, 2024

[Neural_chat] Chat completion is very slow with neuralchat_server #1616

[Neural_chat] Chat completion is very slow with neuralchat_server #1616

Comments

noobHappylife commented Jun 17, 2024