Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stubgen] Overloaded signatures are dropped if TokenError is encountered while parsing docstrings. #18388

Open
gareth-cross opened this issue Dec 30, 2024 · 0 comments
Labels
bug mypy got something wrong topic-stubgen

Comments

@gareth-cross
Copy link
Contributor

gareth-cross commented Dec 30, 2024

Bug Report

While processing pybind modules, stubgen inspects the docstring in order to determine the possible function signatures (there may be many if the function is overloaded). During inspection, tokenize.tokenize is invoked and TokenError is suppressed:

mypy/mypy/stubdoc.py

Lines 349 to 358 in 55d4c17

state = DocStringParser(name)
# Return all found signatures, even if there is a parse error after some are found.
with contextlib.suppress(tokenize.TokenError):
try:
tokens = tokenize.tokenize(io.BytesIO(docstr.encode("utf-8")).readline)
for token in tokens:
state.add_token(token)
except IndentationError:
return None
sigs = state.get_signatures()

However, some tokenization errors prevent detection of the function signature. For example, having an unterminated string literal in the docstring is valid (in the context of a Python docstring), but will cause tokenization to stop.

To Reproduce

The following docstring should trigger this behavior:

def thing():
  """
  thing(*args, **kwargs)
  Overloaded function.

  1. thing(x: int) -> None

  This is a valid docstring. "We do not need to terminate this string literal on this line.

  2. thing(x: int, y: int) -> str

  This signature will never get parsed due to TokenError.
  """

The example above terminates with unterminated string literal before overload 2 is reached, resulting in a missing signature. The resulting signatures produced by infer_sig_from_docstring are:

[FunctionSig(name='thing', args=[ArgSig(name='x', type='int', default=False)], ret_type='None', type_args=''),
 FunctionSig(name='thing', args=[ArgSig(name='*args', type=None, default=False), ArgSig(name='**kwargs', type=None, default=False)], ret_type='Any', type_args='')]

Alternatively, a math RST block in the docstring will also cause this behavior.

def thing():
  """
  thing(*args, **kwargs)
  Overloaded function.

  1. thing(x: int) -> None

  .. math::
    \mathbf{x} = 3 \cdot \mathbf{y}

  2. thing(x: int, y: int) -> str

  This signature will never get parsed due to TokenError.
  """

The second signature is never parsed due to unexpected character after line continuation character.

Expected Behavior

Ideally all signatures would be detected. It is understandable that it fails, since the scope of things that can appear in a docstring is fairly arbitrary.

Actual Behavior

The first signature will be extracted, but subsequent signatures are not detected. My guess is that this happens because of the tokenization error produced by the first docstring in the list of overloads.

Your Environment

  • Mypy version used: 1.14
  • Mypy command-line flags: --package --output
  • Mypy configuration options from mypy.ini (and other config files): None
  • Python version used: 3.12

Possible Fix

The following might be a viable fix. I tried changing the logic to resume tokenization after errors (provided there is data remaining):

# Keep tokenizing after an error. If `TokenError` is enountered, tokenize() will
# stop. We check the remaining bytes in bytes_io and resume tokenizing on the next
# loop iteration.
encoded_docstr = docstr.encode("utf-8")
bytes_io = io.BytesIO(encoded_docstr)
while bytes_io.tell() < len(encoded_docstr):
    # Return all found signatures, even if there is a parse error after some are found.
    with contextlib.suppress(tokenize.TokenError):
        try:
            tokens = tokenize.tokenize(bytes_io.readline)
            for token in tokens:
                state.add_token(token)
        except IndentationError:
            return None

On both of my examples above, this produces the correct # of signatures. infer_sig_from_docstring returns:

[FunctionSig(name='thing', args=[ArgSig(name='x', type='int', default=False)], ret_type='None', type_args=''),
 FunctionSig(name='thing', args=[ArgSig(name='x', type='int', default=False), ArgSig(name='y', type='int', default=False)], ret_type='str', type_args=''),
 FunctionSig(name='thing', args=[ArgSig(name='*args', type=None, default=False), ArgSig(name='**kwargs', type=None, default=False)], ret_type='Any', type_args='')]

If you are amenable to this solution, I can open a PR.

@gareth-cross gareth-cross added the bug mypy got something wrong label Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug mypy got something wrong topic-stubgen
Projects
None yet
Development

No branches or pull requests

2 participants