-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rate limits for LLMs and Embedding Models #520
Conversation
I tried this on I turned on default settings (confusingly Was using our tier 1 project, just three medium size pdfs in the index. |
…ehints to return properties, add basal tokens to completion
…through, update default rate limits
You can ask via |
…rate_limit classes
# RATE_CONFIG keys are tuples, corresponding to a namespace and primary key. | ||
# Anything defined with MATCH_ALL variable, will match all non-matched requests for that namespace. | ||
# For the "get" namespace, all primary key urls will be parsed down to the domain level. | ||
# For example, you're trying to do a get request to "https://google.com", "google.com" will get | ||
# its own limit, and it will use the ("get", MATCH_ALL) for its limits. | ||
# machine_id is a unique identifier for the machine making the request, it's used to limit the | ||
# rate of requests per machine. If the primary_key is in the NO_MACHINE_ID_EXTENSIONS list, then | ||
# the dynamic IP of the machine will be used to limit the rate of requests, otherwise the | ||
# user input machine_id will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt of moving this directly above RATE_CONFIG
? It's nice to keep docs next to their usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: James Braza <[email protected]>
Co-authored-by: James Braza <[email protected]>
…de to not use json, and add warning for users
…n-determinism in crossref
LiteLLM's rate limits weren't suitable for PaperQA in that we wanted rate limits that could span models. This PR adds them in with both an in-memory based rate limiter, as well as a Redis based one for rate limiting across processes.
The implementation adds a new decorator,
rate_limited
to theLiteLLMModel
class across all 4 inference methods. This decorator checks for rate limits before (with prompt tokens) and after (with completion tokens) inference. If tokens aren't known (like when using the *_iter methods), then it uses an estimate with character count divided by theCHARACTERS_PER_TOKEN
constant (4). It's technically possible, with low rate limits that don't correspond with a max_token cutoff, that the completions tokens could exceed your maximum allowable tokens in your window of time (say your limit is 20 tokens per second and 100 tokens come back). In this case the rate limiter will wait it out such that your amortized rate will fall back to 20 tokens per second.The implementation is similar for the
LiteLLMModel
and theLiteLLMEmbeddingModel
, where you give theconfig
attribute a key for rate limits like this:or
and for the embedding model: