Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add limits+coredis to mypy #537

Merged
merged 2 commits into from
Oct 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -83,10 +83,12 @@ repos:
args: [--pretty, --ignore-missing-imports]
additional_dependencies:
- aiohttp
- coredis
- fhaviary[llm]>=0.6 # Match pyproject.toml
- ldp>=0.9 # Match pyproject.toml
- html2text
- httpx
- limits
- numpy
- openai>=1 # Match pyproject.toml
- pandas-stubs
Expand Down
33 changes: 21 additions & 12 deletions paperqa/rate_limiter.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,13 @@

GLOBAL_RATE_LIMITER_TIMEOUT = float(os.environ.get("RATE_LIMITER_TIMOUT", "60"))

MATCH_ALL = None
MatchAllInputs = Literal[None]
MATCH_MACHINE_ID = "<machine_id>"

FALLBACK_RATE_LIMIT = RateLimitItemPerSecond(3, 1)
TOKEN_FALLBACK_RATE_LIMIT = RateLimitItemPerMinute(30_000, 1)

# RATE_CONFIG keys are tuples, corresponding to a namespace and primary key.
# Anything defined with MATCH_ALL variable, will match all non-matched requests for that namespace.
# For the "get" namespace, all primary key urls will be parsed down to the domain level.
Expand All @@ -35,13 +42,6 @@
# the dynamic IP of the machine will be used to limit the rate of requests, otherwise the
# user input machine_id will be used.

MATCH_ALL = None
MatchAllInputs = Literal[None]
MATCH_MACHINE_ID = "<machine_id>"

FALLBACK_RATE_LIMIT = RateLimitItemPerSecond(3, 1)
TOKEN_FALLBACK_RATE_LIMIT = RateLimitItemPerMinute(30_000, 1)

RATE_CONFIG: dict[tuple[str, str | MatchAllInputs], RateLimitItem] = {
("get", CROSSREF_BASE_URL): RateLimitItemPerSecond(30, 1),
("get", SEMANTIC_SCHOLAR_BASE_URL): RateLimitItemPerSecond(15, 1),
Expand Down Expand Up @@ -132,8 +132,8 @@ def rate_limiter(self) -> MovingWindowRateLimiter:
return self._rate_limiter

async def parse_namespace_and_primary_key(
self, namespace_and_key: tuple[str, str | MatchAllInputs], machine_id: int = 0
) -> tuple[str, str | MatchAllInputs]:
self, namespace_and_key: tuple[str, str], machine_id: int = 0
) -> tuple[str, str]:
"""Turn namespace_and_key tuple into a namespace and primary-key.

If using a namespace starting with "get", then the primary key will be url parsed.
Expand Down Expand Up @@ -240,10 +240,15 @@ async def get_rate_limit_keys(
if not (host and port):
raise ValueError(f'Invalid REDIS_URL: {os.environ.get("REDIS_URL")}.')

if not isinstance(self.storage, RedisStorage):
raise NotImplementedError(
"get_rate_limit_keys only works with RedisStorage."
)

client = Redis(host=host, port=int(port))

try:
cursor = b"0"
cursor: int | bytes = b"0"
matching_keys: list[bytes] = []
while cursor:
cursor, keys = await client.scan(
Expand All @@ -261,6 +266,10 @@ def get_in_memory_limit_keys(
self,
) -> list[tuple[RateLimitItem, tuple[str, str | MatchAllInputs]]]:
"""Returns a list of current RateLimitItems with tuples of namespace and primary key."""
if not isinstance(self.storage, MemoryStorage):
raise NotImplementedError(
"get_in_memory_limit_keys only works with MemoryStorage."
)
return [self.parse_key(key) for key in self.storage.events]

async def get_limit_keys(
Expand Down Expand Up @@ -291,7 +300,7 @@ async def rate_limit_status(self):

async def try_acquire(
self,
namespace_and_key: tuple[str, str | MatchAllInputs],
namespace_and_key: tuple[str, str],
rate_limit: RateLimitItem | str | None = None,
machine_id: int = 0,
acquire_timeout: float = GLOBAL_RATE_LIMITER_TIMEOUT,
Expand All @@ -301,7 +310,7 @@ async def try_acquire(
"""Returns when the limit is satisfied for the namespace_and_key.

Args:
namespace_and_key (:obj:`tuple[str, str | MatchAllInputs]`): is
namespace_and_key (:obj:`tuple[str, str]`): is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this :obj: syntax here? Normally it's good to not restate types if they're in the docstring already

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's in the style guide for the google doc strings, I think for sphinx compatibility

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw, either Sphinx autodoc or napoleon (I can't remember which) will pull in type hints if they are there

https://google.github.io/styleguide/pyguide.html#383-functions-and-methods note that it does not restate type hints in Args or Returns

https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html in this one, note there's not type hints at all

Anyways, moral of the story I have seen it where people later update or expand type hints, but then don't update the repeated version in the docstring Args or Returns

composed of a tuple with namespace (e.g. "get") and a primary-key
(e.g. "arxiv.org"). namespaces can be nested with multiple '|',
primary-keys in the "get" namespace will be stripped to the domain.
Expand Down