-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse Starlark files as raw bytes for Bzlmod #24217
base: master
Are you sure you want to change the base?
Conversation
This needs #24010 to get its test to pass. |
@bazel-io fork 8.0.0 |
5b5f9a5
to
31f0271
Compare
Ready for review now. @tjgq FYI, this is the other Unicode-related PR I am planning to get into 8.0.0. If you see anything else that looks incompatible, please let me know. |
Turns out that there is an interesting failure, switching back to draft. |
165f03b
to
a68c20e
Compare
@Wyverald CI should be green now, I just had to fix some issues in the Python test setup code. |
hmm, actually -- does this require the "re-land" PR first? if so, we should hold back on the import. |
I think it does for the test to pass. |
As long as Bazel internally represents strings as raw bytes "encoded" in Latin-1, the same must be true for all Starlark files that may contain e.g. file system paths.
a68c20e
to
e2a9596
Compare
@Wyverald The reland has happened. |
Do I understand correctly that, while this PR fixes the use case of non-ASCII characters in file paths, it "breaks" the case where non-ASCII characters are passed in e.g. string attrs in tags? |
It actually fixes both: The |
I see!
I just double-checked, and it seems that we try to write files in UTF-8 in mctx/rctx: code It's been broken like this forever, so presumably nobody has been relying on it working... |
Also, presumably this means |
That's being fixed in #24182 :-)
It doesn't, since the Bazel client ultimately outputs raw bytes as well. That's being tested in #24243.
In case it is not too late, I pushed a new test to demonstrate this end-to-end for tags. |
Ahh, thanks for the explanation. It sounds like we're actually just treating Starlark
Ack -- I'll make sure that's imported too |
Yes, that's correct. It's actually not the worst in a world where you increasingly deal with binary data and UTF-8 strings only. |
@bazel-io fork 8.0.0 |
As long as Bazel internally represents strings as raw bytes "encoded" in Latin-1, the same must be true for all Starlark files that may contain file system paths.
Also includes changes to the Python test setup:
ScratchFile
now always writes files as UTF-8RunProgram
encodes and decodes stdin/stderr/stdout as UTF-8download
no longer leaks a file