Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rate limits for LLMs and Embedding Models #520

Merged
merged 27 commits into from
Oct 4, 2024
Merged

Add rate limits for LLMs and Embedding Models #520

merged 27 commits into from
Oct 4, 2024

Conversation

mskarlin
Copy link
Collaborator

@mskarlin mskarlin commented Oct 3, 2024

LiteLLM's rate limits weren't suitable for PaperQA in that we wanted rate limits that could span models. This PR adds them in with both an in-memory based rate limiter, as well as a Redis based one for rate limiting across processes.

The implementation adds a new decorator, rate_limited to the LiteLLMModel class across all 4 inference methods. This decorator checks for rate limits before (with prompt tokens) and after (with completion tokens) inference. If tokens aren't known (like when using the *_iter methods), then it uses an estimate with character count divided by the CHARACTERS_PER_TOKEN constant (4). It's technically possible, with low rate limits that don't correspond with a max_token cutoff, that the completions tokens could exceed your maximum allowable tokens in your window of time (say your limit is 20 tokens per second and 100 tokens come back). In this case the rate limiter will wait it out such that your amortized rate will fall back to 20 tokens per second.

The implementation is similar for the LiteLLMModel and the LiteLLMEmbeddingModel, where you give the config attribute a key for rate limits like this:

llm = LiteLLMModel(name='gpt-4o-mini', config={
            "rate_limit": {"gpt-4o-mini": RateLimitItemPerSecond(20, 3)},
        })

or

llm = LiteLLMModel(name='gpt-4o-mini', config={
            "model_list": [
                {
                    "model_name": "gpt-4o-mini",
                    "litellm_params": {
                        "model": "gpt-4o-mini",
                        "temperature": 0,
                    },
                }
            ],
            "rate_limit": {"gpt-4o-mini": RateLimitItemPerSecond(20, 1)},
        },
    })

and for the embedding model:

embedding = LiteLLMEmbeddingModel(name='text-embedding-3-small', config={"rate_limit": RateLimitItemPerSecond(20, 5)})

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Oct 3, 2024
paperqa/llms.py Outdated Show resolved Hide resolved
paperqa/rate_limiter.py Show resolved Hide resolved
paperqa/rate_limiter.py Outdated Show resolved Hide resolved
paperqa/llms.py Outdated Show resolved Hide resolved
@whitead
Copy link
Collaborator

whitead commented Oct 3, 2024

I tried this on pqa ask with three papers, and I get a rate limit error.

I turned on default settings (confusingly pqa ask defaults to high_quality)

Was using our tier 1 project, just three medium size pdfs in the index.

paperqa/rate_limiter.py Outdated Show resolved Hide resolved
paperqa/rate_limiter.py Show resolved Hide resolved
paperqa/rate_limiter.py Show resolved Hide resolved
paperqa/rate_limiter.py Show resolved Hide resolved
paperqa/rate_limiter.py Outdated Show resolved Hide resolved
paperqa/rate_limiter.py Outdated Show resolved Hide resolved
paperqa/rate_limiter.py Outdated Show resolved Hide resolved
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Oct 3, 2024
@mskarlin
Copy link
Collaborator Author

mskarlin commented Oct 3, 2024

I tried this on pqa ask with three papers, and I get a rate limit error.

I turned on default settings (confusingly pqa ask defaults to high_quality)

Was using our tier 1 project, just three medium size pdfs in the index.

You can ask via pqa --settings 'tier1_limits' ask 'can pigs fly?' and you should be good now

README.md Show resolved Hide resolved
paperqa/configs/tier2_limits.json Outdated Show resolved Hide resolved
paperqa/rate_limiter.py Outdated Show resolved Hide resolved
paperqa/rate_limiter.py Outdated Show resolved Hide resolved
paperqa/rate_limiter.py Outdated Show resolved Hide resolved
paperqa/rate_limiter.py Show resolved Hide resolved
paperqa/rate_limiter.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Comment on lines +28 to +36
# RATE_CONFIG keys are tuples, corresponding to a namespace and primary key.
# Anything defined with MATCH_ALL variable, will match all non-matched requests for that namespace.
# For the "get" namespace, all primary key urls will be parsed down to the domain level.
# For example, you're trying to do a get request to "https://google.com", "google.com" will get
# its own limit, and it will use the ("get", MATCH_ALL) for its limits.
# machine_id is a unique identifier for the machine making the request, it's used to limit the
# rate of requests per machine. If the primary_key is in the NO_MACHINE_ID_EXTENSIONS list, then
# the dynamic IP of the machine will be used to limit the rate of requests, otherwise the
# user input machine_id will be used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt of moving this directly above RATE_CONFIG? It's nice to keep docs next to their usage

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 3, 2024
@mskarlin mskarlin merged commit ef1a027 into main Oct 4, 2024
5 checks passed
@mskarlin mskarlin deleted the rate-limits branch October 4, 2024 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants