Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Rate Limit Detection For Anthropic doesn't distinguish between input and output tokens #239

Open
RyanMarten opened this issue Dec 10, 2024 · 2 comments
Assignees

Comments

@RyanMarten
Copy link
Contributor

Here you can see that the litellm backend is automatically setting the token rate limit to 480,000, which is the number that is stored in x-ratelimit-limit-tokens in the headers.

However, the correct number is actually 80,000 which is stored in llm_provider-anthropic-ratelimit-output-tokens-limit
Also note there is llm_provider-anthropic-ratelimit-output-tokens-remaining provided as well.

�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m 2024-12-10 09:58:33,099 - bespokelabs.curator.request_processor.litellm_online_request_processor - INFO - Getting rate limits for model: claude-3-5-haiku-20241022
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m INFO:bespokelabs.curator.request_processor.litellm_online_request_processor:Getting rate limits for model: claude-3-5-haiku-20241022
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m INFO:bespokelabs.curator.request_processor.litellm_online_request_processor:Test call headers: {'x-ratelimit-limit-requests': '4000', 'x-ratelimit-remaining-requests': '3999', 'x-ratelimit-limit-tokens': '480000', 'x-ratelimit-remaining-tokens': '480000', 'llm_provider-date': 'Tue, 10 Dec 2024 17:58:34 GMT', 'llm_provider-content-type': 'application/json', 'llm_provider-transfer-encoding': 'chunked', 'llm_provider-connection': 'keep-alive', 'llm_provider-anthropic-ratelimit-requests-limit': '4000', 'llm_provider-anthropic-ratelimit-requests-remaining': '3999', 'llm_provider-anthropic-ratelimit-requests-reset': '2024-12-10T17:58:33Z', 'llm_provider-anthropic-ratelimit-input-tokens-limit': '400000', 'llm_provider-anthropic-ratelimit-input-tokens-remaining': '400000', 'llm_provider-anthropic-ratelimit-input-tokens-reset': '2024-12-10T17:58:34Z', 'llm_provider-anthropic-ratelimit-output-tokens-limit': '80000', 'llm_provider-anthropic-ratelimit-output-tokens-remaining': '80000', 'llm_provider-anthropic-ratelimit-output-tokens-reset': '2024-12-10T17:58:34Z', 'llm_provider-anthropic-ratelimit-tokens-limit': '480000', 'llm_provider-anthropic-ratelimit-tokens-remaining': '480000', 'llm_provider-anthropic-ratelimit-tokens-reset': '2024-12-10T17:58:34Z', 'llm_provider-request-id': 'req_01TiJL1HyucBHBTnDbkrA4Rm', 'llm_provider-via': '1.1 google', 'llm_provider-cf-cache-status': 'DYNAMIC', 'llm_provider-x-robots-tag': 'none', 'llm_provider-server': 'cloudflare', 'llm_provider-cf-ray': '8eff1fa96b6c60b1-ORD', 'llm_provider-content-encoding': 'gzip', 'llm_provider-x-ratelimit-limit-requests': '4000', 'llm_provider-x-ratelimit-remaining-requests': '3999', 'llm_provider-x-ratelimit-limit-tokens': '480000', 'llm_provider-x-ratelimit-remaining-tokens': '480000'}
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m 2024-12-10 09:58:34,323 - bespokelabs.curator.request_processor.base_online_request_processor - INFO - Automatically set max_tokens_per_minute to 480000
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m INFO:bespokelabs.curator.request_processor.base_online_request_processor:Automatically set max_tokens_per_minute to 480000
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m 2024-12-10 09:58:34,323 - bespokelabs.curator.request_processor.base_online_request_processor - INFO - Automatically set max_requests_per_minute to 4000
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m INFO:bespokelabs.curator.request_processor.base_online_request_processor:Automatically set max_requests_per_minute to 4000

Since we use 480,000 instead of 80,000, we quickly run into rate limit errors:

�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m 2024-12-10 09:59:08,577 - bespokelabs.curator.request_processor.base_online_request_processor - WARNING - Request 20 failed with Exception litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization's rate limit of 80,000 output tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}, attempts left 5
@RyanMarten
Copy link
Contributor Author

This is because x-ratelimit-limit-tokens is input + output
And anthropic has separate rate limits for inputs and outputs individually

https://docs.anthropic.com/en/api/rate-limits#updated-rate-limits
Screenshot 2024-12-10 at 10 17 57 AM

@RyanMarten RyanMarten changed the title Automatic Rate Limit Detection For Anthropic Bug Automatic Rate Limit Detection For Anthropic doesn't distinguish between input and output tokens Dec 10, 2024
@RyanMarten
Copy link
Contributor Author

Based on the above comment, a large part of the issue is that max_tokens is used by Anthropic to determine your token consumption.

https://docs.anthropic.com/en/docs/about-claude/models#model-comparison-table

This is 8192 tokens for both Claude 3.5 Haiku and Sonnet.

The solution could be setting max_tokens to max the size of the completions, but we don't want to cut any completions off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants