Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc clarification: confusing match behavior for non-existent ASCII character classes #1234

Open
dawnofmidnight opened this issue Oct 27, 2024 · 1 comment
Labels

Comments

@dawnofmidnight
Copy link

dawnofmidnight commented Oct 27, 2024

Crate version: 1.11.0
Example code: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c4b4cfe18c2e6413444e53315de33b27 (used for snippets below and extra checks)

The behavior of the crate when trying to use the ASCII character class syntax [[:foo:]] with invalid character classes is somewhat confusing. A friend was trying to use [[:XID_Start:]] to check whether _ (underscore/low line) was included in the XID_Start character class (it's not), and was confused when it returned true.

let expr = regex::Regex::new(r"[[:XID_Start:]]").unwrap();
dbg!(expr.is_match("_")); // true

The correct syntax, \p{XID_Start}, does work correctly:

let correct = regex::Regex::new(r"\p{XID_Start}").unwrap();
dbg!(correct.is_match("a")); // true
dbg!(correct.is_match("1")); // false
dbg!(correct.is_match("_")); // false

It seems that when the class is invalid for an ASCII character class (regex § ASCII character classes), it falls back to marking any character present within the brackets as true:

dbg!(expr.is_match(":")); // true
dbg!(expr.is_match("X")); // true
dbg!(expr.is_match("x")); // false
dbg!(expr.is_match("a")); // true
dbg!(expr.is_match("b")); // false
dbg!(expr.is_match("[")); // false
dbg!(expr.is_match("]")); // false

I'm not entirely sure what regex is actually interpreting this sequence as, but, assuming this is intentional behavior, I think that it might be something that is worth documenting in the aforementioned section on ASCII character classes in the docs, as the behavior is not immediately intuitive.

@BurntSushi
Copy link
Member

Yes the behavior is unfortunate but intentional for compatibility with how other regex engines work. In retrospect, I would have rathered being a bit more strict here to produce errors for unrecognized classes.

I agree that adding a note to the docs about this would be a good idea.

@BurntSushi BurntSushi added the doc label Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants