-
Notifications
You must be signed in to change notification settings - Fork 0
Home
mypy is a project that does static type checking on python code, according to type hints in the code (see PEP484).
bazel-mypy-integration is a project to
enable mypy
type-checking for python targets in bazel.
bazel-mypy-integration
understands the python dependency graph represented in bazel. Essentially,
it creates a mypy ...
invocation for a given set of .py
files, as well as the dependencies of
those files.
However, the mypy
invocation does not handle caching of dependencies at all. For example, given:
py_library(
name = "lib1",
srcs = [
"lib1_a.py",
"lib1_b.py",
"lib1_c.py",
]
)
py_library(
name = "lib2",
srcs = [
"lib2.py",
],
deps = [
":lib1",
]
)
mypy_test( # This is a concrete mypy test for lib1
name = "lib1_mypy_test",
deps = [":lib1"]
)
mypy_test(
name = "lib2_mypy_test",
deps = [":lib2"]
)
Running e.g.: bazel test :lib1_mypy_test :lib2_mypy_test
will result in two mypy invocations that
which will both parse (and potentially test, depending on your mypy.ini
) all the source files
involved, transitively. E.g. mypy ... -- lib1_a.py lib1_b.py lib1_c.py
and mypy ... -- lib1_a.py lib1_b.py lib1_c.py lib2.py
. There is no state shared between these, meaning that there is
duplicated effort in parsing the files in common.
Worse still, mypy
ships with type stubs for all of the python standard library (via typeshed
).
If lib*py
imports anything from the python standard library, mypy
further parses the stub files,
again without caching anything.
Currently, bazel-mypy-integration
passes complete paths to source files, wrt. the workspace root.
This introduces a problem with how bazel py_library
targets can operate. Here is an example that
illustrates the problem:
py_library(
name = "lib1",
srcs = [
"srcs/lib1.py",
"srcs/internal/lib1_internal.py",
],
imports = [
"srcs",
]
)
The above example adds the workspace-relative path srcs
to the PYTHONPATH of all dependents of
lib1
. Dependents can depend on this target and do import lib1
and everything works. Furthermore,
lib1.py
can do from internal import lib1_internal
, and that works fine as well. Now let's
examine the mypy
command line produced by bazel-mypy-integration
for this library:
MYPYPATH=$PWD:srcs/
mypy ...args... -- srcs/lib1.py srcs/internal/lib1_internal.py
This leads to an exception in mypy
(versions >= 0.780) which mentions "Source file found twice"
(see issue #20). The issue
stems from how mypy
treats the source file arguments. It expects that if a source file is listed
as srcs/internal/lib1_internal.py
then it must be a module with the same path:
srcs.internal.lib1_internal
. Since in the above example, srcs/lib1.py
does from internal import lib1_internal
, and since mypy
checks the uniqueness of modules it imports, it fails when it
encounters identical modules srcs.internal.lib1_internal
(from the command line) and
internal.lib1_internal
(from an import
statement).
This means that bazel-mypy-integration
's mypy
version is fixed at 0.750
to avoid this problem.
This is not a bug in mypy
(see discussion in
https://github.com/python/mypy/issues/8944).
mypy
natively
supports caching of
type data, through the use of --cache-dir
and a hidden option called --cache-map
.
bazel-mypy-integration
takes advantage of fixed cache locations for each module using argument
triples of the form --cache-map path/to/lib1_a.py <path to mypy's lib1_a.meta.json> <path to mypy's lib1_a.data.json>
. These paths represent both where mypy should generate the metadata for a
parsed source file, and also where to find existing generated metadata (for dependencies, for
example). In our examples above, this would translate in to --cache-map
triples for each of the 3
source files in lib1
and 4 source files for lib2
(its own, and its dependencies' source files).
By capturing the generated .meta.json
/.data.json
pairs as part of the rule invocation, we can
propagate mypy's generated metadata to dependents. This means that mypy
does not have to
re-generate the metadata by re-parsing the dependent source files. This results in markedly faster
performance, even in shallow dependency trees. It's important to note that the mypy
docs mention
the improved performance as well, when caching is enabled.
Similar to the above, typeshed stub parsing can be sped up by propagating the same cache triples and
mypy
.meta.json
/.data.json
pairs for all of the stdlib (which can be a large number of files).
In order to accomplish this, we need to treat the mypy stubs a little differently. typeshed
is an
implicit, internal dependency of mypy
, meaning that the typeshed
pip package is not represented
in bazel at all, it's only available by way of the mypy
pip package. To tease out the valid
typeshed
stubs from requirement("mypy")
, we need to trawl through the files in that package, and
filter the typeshed
ones based on path filtering.
For this purpose, a new rule is introduced called mypy_stdlib_cache_library
, which is similar to
mypy_aspect
and mypy_test
, but has an implementation that is used to deal with typeshed
stubs
only. There needs to be a singleton instance of this target, defined as part of
bazel-mypy-integration
. This singleton is a dependency of all mypy targets.
To solve for the problem described in 2., and to make it possible in general for
bazel-mypy-integration
to work with py_*
targets that specify an imports
attribute, we can
change the above invocation to operate on modules rather than directly on source files:
MYPYPATH=$PWD:srcs/
mypy ...args... -- -m lib1 -m internal.lib1_internal
Now, mypy
finds the import
lines to match exactly the -m
arguments it has received on the
command line, which means there are no duplicate source file errors. In addition, the metadata
generated by mypy
will refer to the correct module paths (ie, the generated metadata will know
that they refer to internal.lib1_internal
and not srcs.internal.lib1_internal
). This is correct
for dependents as well.
In bazel, there is nothing stopping the same source file(s) being a part of multiple py_library
targets. Consider for example:
py_library(
name = "a",
srcs = ["a.py"]
)
py_library(
name = "a_prime",
srcs = ["a.py"]
)
py_binary(
name = "bin",
srcs = ["bin.py"],
deps = [
":a",
":a_prime",
]
)
There is nothing preventing this situation, and it's perfectly valid in terms of bazel dependencies and python rules to have this. The runfiles will contain a.py and all is fine.
However, mypy
expects --cache-map
arguments for each python source file, and it does not want
duplicate --cache-map
arguments pointing at the same a.py
(it exits with an error in this case;
all --cache-map
source files must be unique). Since each py source file must have a unique
--cache-map
argument, we have a dilemma, since both a
and a_prime
specify the same source. If
we were to operate as usual, both sets of --cache-map
arguments would be in the transitive set of
--cache-map
triples arguments, which fails.
The solution is to just pick one cache map argument (the first one encountered). This is not a
perfect solution, though, since it is possible for the same py source file, at the exact same
location, to produce a different set of mypy
metadata (for example, if the python target's
imports
path was different). But this seems pathlogical enough to not worry about it. Just picking
the one cache map argument is fine because:
- if
bin.py
importsa
, and there is no difference in cached metadata, then the cached metadata will be used correctly. - if
bin.py
importsa
, and there is a difference in cached metadata (for example, a difference in module name), then themypy
will just regenerate the metadata fora.py
.
The above scenario breaks down when there are multiple imports
specified in a single py_library
:
py_library(
name = "lib1",
srcs = [
"srcs/lib1.py",
"srcs/internal/lib1_internal.py",
],
imports = [
"srcs",
"srcs/internal",
]
)
If the above looks odd, it kind of is, because it specifies that all modules under srcs
and
under srcs/internal
can be import
ed directly. This is not a convention in python, and it seems
like kind of an edge case.
In any case, handling it with bazel-mypy-integration
does not seem clear, for the following
reason: Remember that the --cache-map
arguments to mypy take a source file path, a .meta.json
path, and a .data.json
path. The module path is encoded in the generated metadata. If the same
source files can lead to multiple module paths (internal.lib1_internal
and lib1_internal
),
that's more information than can be represented in the --cache-map
argument, because it requires a
single source file (this seems like an oversite in the design of mypy
). In other words, the
--cache-map
arguments imply a specific file and module path combination.
In any case, in the above pathlogical case of multiple imports
, the best we can do is just pick
the first import path, and refer to the source file through a single module path (e.g. -m internal.lib1_internal
).