Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Neural_chat] Chat completion is very slow with neuralchat_server #1616

Open
noobHappylife opened this issue Jun 17, 2024 · 0 comments
Open
Assignees

Comments

@noobHappylife
Copy link

I noticed when using neuralchat_server for a chat completion is very slow, compared to loading the model through AutoModelForCausalLM then do a generate (after applying chat_template).

  • Both cases, I'm using same quantization config and the same model
# with intel_extension_for_transformer 
RtnConfig(compute_dtype="fp32",weight_dtype="int4")
# yaml config with neuralchat_server 
optimization:
    use_neural_speed: true
    optimization_type: "weight_only"
    compute_dtype: "fp32"
    weight_dtype: "int4"

Is this slowdown with the neuralchat_server expected? Or is there other alternative to start an OpenAI compatible api server?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants