Skip to content

Proposed rules: encourage use of re.compile on constants in module scope, forbid it in function scope #536

@sirosen

Description

@sirosen

I'd be interested in writing checks for this, if it would be acceptable.
Please just tell me / close this issue if these rules would not be welcome.

I often see re usage which would be made better by (1) using re.compile (rather than the free functions which take a string) or (2) moving re.compile from function scope out to module scope.1

Of course, for these changes to be correct, the regex string needs to be constant, but that's the common case.
The regex flags also need to be considered.

Example 1, encourage moving away from the free functions when a pattern can be pre-compiled:

# okay
def foo(x):
    return re.search(r"\w+", x)

# better
_WORD_PAT = re.compile(r"\w+")
def foo(x):
    return _WORD_PAT.search(x)

Example 2, there's no real reason to do it this way:

# bad, why write this? compile out in module scope
def foo(x):
    pat = re.compile(r"\w+")
    return pat.search(x)

(fix is the same as Example 1)

Example 3, can we handle flags? What if they're ORed together?

def foo(x):
    pat = re.compile(r"[abc\n]+", flags=re.I | re.M)
    return pat.search(x)

I therefore suggest two rules, one on by default and one in the opinionated category:

  • Opinionated: You should use re.compile() instead of {re.search, re.match, re.fullmatch, re.split, re.findall, re.finditer, re.sub, re.subn} on a string
  • Non-opinionated: You should not call re.compile() on a constant in function scope -- do it in class or module scope

I'd suggest starting with turning these rules on only where flags is omitted from the call, so the 1-argument form of re.compile and the 2-argument forms of the various functions. flags handling can be a later enhancement.


Implementation should be simple. Detecting these names would be similar to B017 -- just look for a node named "re" and don't worry about the name being rebound in funny ways.

Footnotes

  1. Yes, it can be okay if the re cache covers your needs. But I've seen this in library code, where you cannot make any assumptions about how many regexes are being compiled and put in the cache.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions