Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Starlark files as raw bytes for Bzlmod #24217

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Nov 5, 2024

As long as Bazel internally represents strings as raw bytes "encoded" in Latin-1, the same must be true for all Starlark files that may contain file system paths.

Also includes changes to the Python test setup:

  • ScratchFile now always writes files as UTF-8
  • RunProgram encodes and decodes stdin/stderr/stdout as UTF-8
  • download no longer leaks a file

@github-actions github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. awaiting-review PR is awaiting review from an assigned reviewer labels Nov 5, 2024
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 5, 2024

This needs #24010 to get its test to pass.

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 5, 2024

@bazel-io fork 8.0.0

@fmeum fmeum marked this pull request as draft November 7, 2024 16:26
@fmeum fmeum marked this pull request as ready for review November 7, 2024 18:22
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 7, 2024

Ready for review now.

@tjgq FYI, this is the other Unicode-related PR I am planning to get into 8.0.0. If you see anything else that looks incompatible, please let me know.

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 7, 2024

Turns out that there is an interesting failure, switching back to draft.

@fmeum fmeum marked this pull request as draft November 7, 2024 19:01
@fmeum fmeum force-pushed the 23859-unicode-bzlmod branch 8 times, most recently from 165f03b to a68c20e Compare November 7, 2024 22:46
@fmeum fmeum marked this pull request as ready for review November 7, 2024 22:47
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 7, 2024

@Wyverald CI should be green now, I just had to fix some issues in the Python test setup code.

@Wyverald Wyverald added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Nov 13, 2024
@Wyverald
Copy link
Member

hmm, actually -- does this require the "re-land" PR first? if so, we should hold back on the import.

@Wyverald Wyverald added awaiting-review PR is awaiting review from an assigned reviewer and removed awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally labels Nov 13, 2024
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 13, 2024

hmm, actually -- does this require the "re-land" PR first? if so, we should hold back on the import.

I think it does for the test to pass.

As long as Bazel internally represents strings as raw bytes "encoded" in Latin-1, the same must be true for all Starlark files that may contain e.g. file system paths.
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 14, 2024

@Wyverald The reland has happened.

@fmeum fmeum requested a review from Wyverald November 14, 2024 18:10
@Wyverald Wyverald added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Nov 15, 2024
@Wyverald
Copy link
Member

Do I understand correctly that, while this PR fixes the use case of non-ASCII characters in file paths, it "breaks" the case where non-ASCII characters are passed in e.g. string attrs in tags?

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 15, 2024

Do I understand correctly that, while this PR fixes the use case of non-ASCII characters in file paths, it "breaks" the case where non-ASCII characters are passed in e.g. string attrs in tags?

It actually fixes both: The .bzl files containing the module extension source are read as raw bytes, module and repo context functions read and write raw bytes from files, ... With this change, strings attrs in tags will also line up with e.g. string literals in extension code.

@Wyverald
Copy link
Member

I see!

module and repo context functions read and write raw bytes from files

I just double-checked, and it seems that we try to write files in UTF-8 in mctx/rctx: code

It's been broken like this forever, so presumably nobody has been relying on it working...

@Wyverald
Copy link
Member

Also, presumably this means print(some_tag.some_attr) could yield gibberish, but that's nothing new. (To be clear, the import is already under way -- this is just me trying to understand the problem better)

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 15, 2024

I just double-checked, and it seems that we try to write files in UTF-8 in mctx/rctx: code
It's been broken like this forever, so presumably nobody has been relying on it working...

That's being fixed in #24182 :-)

Also, presumably this means print(some_tag.some_attr) could yield gibberish, but that's nothing new.

It doesn't, since the Bazel client ultimately outputs raw bytes as well. That's being tested in #24243.

To be clear, the import is already under way -- this is just me trying to understand the problem better

In case it is not too late, I pushed a new test to demonstrate this end-to-end for tags.

@Wyverald
Copy link
Member

Wyverald commented Nov 15, 2024

Ahh, thanks for the explanation. It sounds like we're actually just treating Starlark strs as byte arrays everywhere now, right?

In case it is not too late, I pushed a new test to demonstrate this end-to-end for tags.

Ack -- I'll make sure that's imported too

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 15, 2024

Ahh, thanks for the explanation. It sounds like we're actually just treating Starlark strs as byte arrays everywhere now, right?

Yes, that's correct. It's actually not the worst in a world where you increasingly deal with binary data and UTF-8 strings only.

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 15, 2024

@bazel-io fork 8.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants