Automatic Rate Limit Detection For Anthropic doesn't distinguish between input and output tokens #239

RyanMarten · 2024-12-10T18:09:10Z

Here you can see that the litellm backend is automatically setting the token rate limit to 480,000, which is the number that is stored in x-ratelimit-limit-tokens in the headers.

However, the correct number is actually 80,000 which is stored in llm_provider-anthropic-ratelimit-output-tokens-limit
Also note there is llm_provider-anthropic-ratelimit-output-tokens-remaining provided as well.

�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m 2024-12-10 09:58:33,099 - bespokelabs.curator.request_processor.litellm_online_request_processor - INFO - Getting rate limits for model: claude-3-5-haiku-20241022
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m INFO:bespokelabs.curator.request_processor.litellm_online_request_processor:Getting rate limits for model: claude-3-5-haiku-20241022
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m INFO:bespokelabs.curator.request_processor.litellm_online_request_processor:Test call headers: {'x-ratelimit-limit-requests': '4000', 'x-ratelimit-remaining-requests': '3999', 'x-ratelimit-limit-tokens': '480000', 'x-ratelimit-remaining-tokens': '480000', 'llm_provider-date': 'Tue, 10 Dec 2024 17:58:34 GMT', 'llm_provider-content-type': 'application/json', 'llm_provider-transfer-encoding': 'chunked', 'llm_provider-connection': 'keep-alive', 'llm_provider-anthropic-ratelimit-requests-limit': '4000', 'llm_provider-anthropic-ratelimit-requests-remaining': '3999', 'llm_provider-anthropic-ratelimit-requests-reset': '2024-12-10T17:58:33Z', 'llm_provider-anthropic-ratelimit-input-tokens-limit': '400000', 'llm_provider-anthropic-ratelimit-input-tokens-remaining': '400000', 'llm_provider-anthropic-ratelimit-input-tokens-reset': '2024-12-10T17:58:34Z', 'llm_provider-anthropic-ratelimit-output-tokens-limit': '80000', 'llm_provider-anthropic-ratelimit-output-tokens-remaining': '80000', 'llm_provider-anthropic-ratelimit-output-tokens-reset': '2024-12-10T17:58:34Z', 'llm_provider-anthropic-ratelimit-tokens-limit': '480000', 'llm_provider-anthropic-ratelimit-tokens-remaining': '480000', 'llm_provider-anthropic-ratelimit-tokens-reset': '2024-12-10T17:58:34Z', 'llm_provider-request-id': 'req_01TiJL1HyucBHBTnDbkrA4Rm', 'llm_provider-via': '1.1 google', 'llm_provider-cf-cache-status': 'DYNAMIC', 'llm_provider-x-robots-tag': 'none', 'llm_provider-server': 'cloudflare', 'llm_provider-cf-ray': '8eff1fa96b6c60b1-ORD', 'llm_provider-content-encoding': 'gzip', 'llm_provider-x-ratelimit-limit-requests': '4000', 'llm_provider-x-ratelimit-remaining-requests': '3999', 'llm_provider-x-ratelimit-limit-tokens': '480000', 'llm_provider-x-ratelimit-remaining-tokens': '480000'}
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m 2024-12-10 09:58:34,323 - bespokelabs.curator.request_processor.base_online_request_processor - INFO - Automatically set max_tokens_per_minute to 480000
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m INFO:bespokelabs.curator.request_processor.base_online_request_processor:Automatically set max_tokens_per_minute to 480000
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m 2024-12-10 09:58:34,323 - bespokelabs.curator.request_processor.base_online_request_processor - INFO - Automatically set max_requests_per_minute to 4000
�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m INFO:bespokelabs.curator.request_processor.base_online_request_processor:Automatically set max_requests_per_minute to 4000

Since we use 480,000 instead of 80,000, we quickly run into rate limit errors:

�[36m(_Completions pid=408316, ip=10.120.7.8)�[0m 2024-12-10 09:59:08,577 - bespokelabs.curator.request_processor.base_online_request_processor - WARNING - Request 20 failed with Exception litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization's rate limit of 80,000 output tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}, attempts left 5

The text was updated successfully, but these errors were encountered:

RyanMarten · 2024-12-10T18:18:20Z

This is because x-ratelimit-limit-tokens is input + output
And anthropic has separate rate limits for inputs and outputs individually

https://docs.anthropic.com/en/api/rate-limits#updated-rate-limits

RyanMarten · 2024-12-10T18:33:22Z

Based on the above comment, a large part of the issue is that max_tokens is used by Anthropic to determine your token consumption.

https://docs.anthropic.com/en/docs/about-claude/models#model-comparison-table

This is 8192 tokens for both Claude 3.5 Haiku and Sonnet.

The solution could be setting max_tokens to max the size of the completions, but we don't want to cut any completions off.

RyanMarten assigned CharlieJCJ Dec 10, 2024

RyanMarten changed the title ~~Automatic Rate Limit Detection For Anthropic Bug~~ Automatic Rate Limit Detection For Anthropic doesn't distinguish between input and output tokens Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic Rate Limit Detection For Anthropic doesn't distinguish between input and output tokens #239

Automatic Rate Limit Detection For Anthropic doesn't distinguish between input and output tokens #239

RyanMarten commented Dec 10, 2024

RyanMarten commented Dec 10, 2024

RyanMarten commented Dec 10, 2024

Automatic Rate Limit Detection For Anthropic doesn't distinguish between input and output tokens #239

Automatic Rate Limit Detection For Anthropic doesn't distinguish between input and output tokens #239

Comments

RyanMarten commented Dec 10, 2024

RyanMarten commented Dec 10, 2024

RyanMarten commented Dec 10, 2024