-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Gunicorn and use Uvicorn only for gateway #530
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big change, can you include a summary of the test plan? I'm worried about any performance impact
"--workers", | ||
f"{num_workers}", | ||
"1", # Let the Kubernetes deployment handle the number of pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this reduce the amount of traffic we can receive per-pod? Why not keep it at 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to remove load balancing within pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we increasing the number of pods by 4 to compensate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the initial plan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh i think llm engine is overprovisioned
In addition to:
Let's also monitor post-rollout 👀 |
Pull Request Summary
in kubernetes environment we don't really need multiple workers in the same pod, rather it's simpler to just have kubernetes autoscale the number of pods. based on some internal benchmarks gunicorn has some known load balancing issues, also removing this layer results in less error and better latency
Test Plan and Usage Guide
will run simple load testing for get requests with and without gunicorn