-
Notifications
You must be signed in to change notification settings - Fork 2.8k
text_dir_codepoint lint update #16452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
r? @weihanglo rustbot has assigned @weihanglo. Use |
|
Feel free to clean up the commits for how they should be reviewed and merged, rather than reflecting the implementation. On top of that, our contrib guide suggests adding tests in a commit before the feature, showing the current behavior. The feature commit would then update the tests to show the new behavior. When reviewing the feature commit, this helps remove all of the boilerplate from the view and see exactly what changed in behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good starting point would be to check the history of rustc as to why they made them separate lints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it’s easier to keep lints together cause we parse cargo in raw format so, but i will check and come back later after new year :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rustc separated the lints because in Rust code, BiDi in string literals might be legitimate (RTL language support) while in comments it's always suspicious. In TOML there's no such distinction - BiDi anywhere in a manifest is suspicious (no legitimate RTL use case for dependency names/versions). Single lint makes sense for Cargo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds reasonable to me. Thanks for the investigation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package.description, package.metadata, and workspace.metadata do up some possibilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those cases users can set text_direction_codepoint = "warn" or "allow" in [lints.cargo]. Deny-by-default is still appropriate since legitimate RTL in manifests is rare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are likely valid reasons for package.description to have text direction code points, e.g. if they are written in a non-english language that uses a different direction. This becomes even bigger of a deal for *.metadata.
My main concern is actually around performance. I was going to suggest we start this as one lint and if there is enough of a motivating case, we split this into two lints with the current lint becoming a lint group. However, in thinking that through, I realized we can easily implement this lint without much of a performance impact.
The way I see this working is we have one function for checking both lints. This function would do a string search for the different code points in question in the Cargo.toml, collecting their spans. If they are not present, we bail out. If they are present, we use toml_parser to create Events. We then iterate through all of the Events to see if they include the span in question and based on EventKind, we choose which lint to fire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i already started to work on your suggestion, forgot to answer. thanks, i like your idea, will push new method soon
I can restructure commits if needed, but the tests wouldn't show meaningful info because before there was no lint to trigger. sorry btw, for next PR's i would take that info |
"before there was no lint to trigger": that is exactly what we want to see with this. |
k, when i get to my PC in an hour will do it correctly 💯 |
Done, restructured commits |
|
Thanks! |
np, for future PR's - will keep it in mind. thanks for marking this |
|
if anything is need - i’m here 24/7 :3 just wanna merge it and take the next task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to fix CI job failures (rustfmt).
src/cargo/core/workspace.rs
Outdated
| // Only check for BiDi codepoints in virtual workspace manifests. | ||
| // For package roots, the lint is run in emit_pkg_lints. | ||
| if matches!(self.root_maybe(), MaybePackage::Virtual(_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this logic into the lint function itself?
In the future we might want to make the lint infra a LintContext, and register lints all at once, instead of the current manual calling functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected it, thanks, i haven't thought about it (my bad)
| .with_stderr_data(str![[r#" | ||
| [CHECKING] foo v0.0.1 ([ROOT]/foo) | ||
| [FINISHED] `dev` profile [unoptimized + debuginfo] target(s) in [ELAPSED]s | ||
| [ERROR] unicode codepoint changing visible direction of text present in manifest: `/u{202E}` (RIGHT-TO-LEFT OVERRIDE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [ERROR] unicode codepoint changing visible direction of text present in manifest: `/u{202E}` (RIGHT-TO-LEFT OVERRIDE) | |
| [ERROR] unicode codepoint changing visible direction of text present in manifest |
Can we instead putting "/u{202E} (RIGHT-TO-LEFT OVERRIDE)" as a label of the annotated span, like what rustc lint displays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, done
| [ERROR] unicode codepoint changing visible direction of text present in manifest: `/u{202E}` (RIGHT-TO-LEFT OVERRIDE) | ||
| --> Cargo.toml:6:18 | ||
| | | ||
| 6 | description = "A �test� package" | ||
| | ^ this invisible unicode codepoint changes text flow direction | ||
| | | ||
| = [NOTE] `cargo::text_direction_codepoint` is set to `deny` by default | ||
| = [HELP] if their presence wasn't intentional, you can remove them | ||
| [ERROR] unicode codepoint changing visible direction of text present in manifest: `/u{202D}` (LEFT-TO-RIGHT OVERRIDE) | ||
| --> Cargo.toml:6:23 | ||
| | | ||
| 6 | description = "A �test� package" | ||
| | ^ this invisible unicode codepoint changes text flow direction | ||
| | | ||
| = [HELP] if their presence wasn't intentional, you can remove them | ||
| [ERROR] encountered 2 errors while running lints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not a blocker)
Do we want to have one error per finding, or one per line, or whathever rustc does if making sense to cargo? I found rustc's approach looks nicer btw
error: unicode codepoint changing visible direction of text present in comment
--> $DIR/unicode-control-codepoints.rs:40:1
|
LL | //"/*� } �if isAdmin� � begin admins only */"
| ^^^^^-^^^-^^^^^^^^^^-^-^^^^^^^^^^^^^^^^^^^^^^
| | | | | |
| | | | | '\u{2066}'
| | | | '\u{2069}'
| | | '\u{2066}'
| | '\u{202e}'
| this comment contains invisible unicode text flow control codepoints
|
= note: these kind of unicode codepoints change the way text flows on applications that support them, but can cause confusion because they change the order of characters on the screen
= help: if their presence wasn't intentional, you can remove them
| 6 | description = "A �test package" | ||
| | ^ this invisible unicode codepoint changes text flow direction | ||
| | | ||
| = [NOTE] `cargo::text_direction_codepoint` is set to `deny` by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rustc has this note, which I found useful explaining why the lint.
= note: these kind of unicode codepoints change the way text flows on applications that support them, but can cause confusion because they change the order of characters on the screen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added too
|
@weihanglo thanks for your suggestions! it was very helpful. small diffs make final work more beautiful and user-friendly. |
|
Nice. Feel free to force-push. The commit history is expected to be what it should be merged and reviewed, not how it was developed. |
|
done, i think , thanks for all the help |
|
@epage hi, please take a look if you have some time, waiting for your opinion :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the question, looks good to me.
| let mut line_map = Vec::new(); | ||
| line_map.push(0); | ||
| for (idx, ch) in contents.char_indices() { | ||
| if ch == '\n' { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this match how annotate-snippets detect new lines?
cc @Muscraft
34744de to
209c2cc
Compare
This comment has been minimized.
This comment has been minimized.
Tests for detecting Unicode BiDi control codepoints in Cargo.toml. Currently these pass as there is no lint implementation yet.
|
This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
| group = group.element( | ||
| Level::HELP.message("if their presence wasn't intentional, you can remove them"), | ||
| ); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust suggests escaping the codepoints in literals, and we should do the same.
Note: Basic strings are the only strings that support Unicode literal escaping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to provide suggestions for literal strings and bare keys, you can decode the string and then re-encode it using the equivalent basic string type using
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let mut findings_by_line: std::collections::BTreeMap<usize, Vec<_>> = | ||
| std::collections::BTreeMap::new(); | ||
| for finding in disallowed_findings { | ||
| let line_num = line_map | ||
| .iter() | ||
| .rposition(|&line_start| line_start <= finding.byte_idx) | ||
| .map(|i| i + 1) | ||
| .unwrap_or(1); | ||
| findings_by_line | ||
| .entry(line_num) | ||
| .or_insert_with(Vec::new) | ||
| .push(finding); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust groups codepoints by comment/literal, and we should match that. In our case, we would match by toml_parser::Event.
a026552 to
0c21d4f
Compare
|
i think now it's closed all of the suggestions. looks pretty nice to me, thanks |
What does this PR try to resolve?
This PR implements the text_direction_codepoint lint for
Cargo.tomlfiles:Fixes #16373 and #16374
lint detects Unicode BiDi (bidirectional) control codepoints that can be used in "Trojan Source" attacks. These invisible characters can alter the visual display of text without changing its underlying representation, potentially making malicious manifest content appear benign. All like descritped in rustc's text_direction_codepoint_in_comment/literal lint but for Cargo manifests.
Detected codepoints:
U+202A LEFT-TO-RIGHT EMBEDDING
U+202B RIGHT-TO-LEFT EMBEDDING
U+202C POP DIRECTIONAL FORMATTING
U+202D LEFT-TO-RIGHT OVERRIDE
U+202E RIGHT-TO-LEFT OVERRIDE
U+2066 LEFT-TO-RIGHT ISOLATE
U+2067 RIGHT-TO-LEFT ISOLATE
U+2068 FIRST STRONG ISOLATE
U+2069 POP DIRECTIONAL ISOLATE
Default level:
denyHow to test and review this PR?
cargo test -p cargo --test testsuite -- lints::text_direction_codepointManual testing:
Create a manifest with BiDi codepoint (U+202E)
The lint can be configured via
[lints.cargo]: