-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing syntax behavior for single quote to allow binary and hex literal separator #3310
base: master
Are you sure you want to change the base?
Changing syntax behavior for single quote to allow binary and hex literal separator #3310
Conversation
Updated skip regex. It was skipping in some cases where it shouldn't, now it won't. char testGoodChar = 'a'
char testBadChar = 'aba'
char testGoodInvis = '\t'
char testBadInvis = '\tn'
int testBinary = 0b0000'0000;
int testHex = 0x0AB0'aA20;
int testNormal = 2;
//Good char if statement
if(testChar[i] == '.')
//Good invisible char if statement
if(testChar[i] == '\t')
//Bad invisible char if statement
if(testChar[i] == '\tn')
//Bad char if statement
if(testChar[i] == 'sfs')
//Good hex If statement
if(testHex[i] == 0x1234'1234)
//Good binary If statement
if(testBin[i] == 0b0000'1111) |
6cbfe1c
to
9ed60c8
Compare
runtime/syntax/c.yaml
Outdated
- constant.number: "(\\b([1-9][0-9]*|0[0-7]*|0[Xx][0-9A-Fa-f]+|0[Bb][01]+)([Uu][Ll]?[Ll]?|[Ll][Ll]?[Uu]?)?\\b)" | ||
# Decimal Floating Constants | ||
- constant.number: "(\\b(([0-9]*[.][0-9]+|[0-9]+[.][0-9]*)([Ee][+-]?[0-9]+)?|[0-9]+[Ee][+-]?[0-9]+)[FfLl]?\\b)" | ||
- constant.number: "(\\b([1-9][0-9']*|0[0-7']*|0[Xx][0-9a-fA-F']+|0[Bb][01']+)([Uu]?[Ll][Ll]?|[Ll][Ll]?[Uu]?)?\\b)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the switch of the order in the list:
There are still places in the file where it's written the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change from A-Fa-f
to a-fA-F
is just to make it "consistent" with cpp.yaml
. And I would do it the other way around, i.e. fix cpp.yaml
, since A-Fa-f
seems to be more conventional. On the other hand, it's not so important.
What is more important is the change from [Uu][Ll]?[Ll]?|[Ll][Ll]?[Uu]?
to [Uu]?[Ll][Ll]?|[Ll][Ll]?[Uu]?
which is, again, just for consistency with cpp.yaml
. And I believe the cpp.yaml
version is actually incorrect: it doesn't highlight numbers with U
suffix only, without L
, e.g. 12345U
. So it should be, again, the other way around: cpp.yaml
should be fixed.
Also, both these changes are not related to single quote literals, and thus should be done in a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted the accidental changes for the order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmaluka
Yeah I was just copying from cpp.yaml cuz I thought these lines are the same (they looked similar) so I just copied and pasted.
I can fix whichever is wrong in this or another PR if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, as I said, I think [Uu]?[Ll][Ll]?|[Ll][Ll]?[Uu]?
in cpp.yaml
is wrong and should be changed to the c.yaml
version. No need for a separate PR for that, it can be done in this PR, but better in a separate commit (within this PR), to keep the commit history comprehensible.
(I also mean that you'd better clean up the commit history that is already present in this PR: instead of 3 commits each of which fixes mistakes of previous commits which didn't exist before this PR and thus are not interesting to anyone, the commit history should rather consist of clear logical changes, i.e. one commit fixes single-quote literals, another commit fixes the issue with U
suffix in cpp.yaml
, and so on.)
...And about single quote literals: I see this just copies support for them from cpp.yaml
, which in itself makes sense (assuming that they have the same syntax in both C++14 and C23). But I think it is incorrect in several ways, so it's a good moment to fix it (in both cpp.yaml
and here). At least:
- A number cannot end with
'
, it must have at least one digit at the end. - In hex, oct and bin numbers,
'
cannot be right after0x
,0
, or0b
, there must be at least one digit before it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will apply the ones in c.yaml
to cpp.yaml
and also see if I can get 1 and 2 working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, I am not sure what to do with floating points since the part before or after the decimal are optional (i.e. 1.f
or .1f
and hex/bin equivalent). Enforcing the rules are going to make the already long regex be longer I think. I am inclined to just leave it as it is.
As for non floating points, I can enforce 1 of them but not both in one regex. If I do both, the closest I can get is this 0[Bb][01][01']*[01]
which is wrong since 0b0
won't work.
I will have to span each representation into multiple lilnes if I have to enforce these 2 rules. Unless you have another way of doing it. I am not the best at regex tbh 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to span each representation into multiple lilnes if I have to enforce these 2 rules. Unless you have another way of doing it. I am not the best at regex tbh 😅
As you can see, you can specify multiple regexes for constant.number
instead of one very long regex. There are already 3 of them, you can split them further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have broken them down for each scenario. It should work now.
e6ed23d
to
09d19cc
Compare
runtime/syntax/cpp.yaml
Outdated
@@ -47,9 +47,9 @@ rules: | |||
- constant.string: | |||
start: "'" | |||
end: "'" | |||
skip: "\\\\." | |||
skip: "(\\\\.)|(.*[^']$)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This almost doesn't make sense to me, and only works if the number is at the end of a line. Why $
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will reorder the commit. That's outdated change
runtime/syntax/cpp.yaml
Outdated
@@ -47,9 +47,9 @@ rules: | |||
- constant.string: | |||
start: "'" | |||
end: "'" | |||
skip: "(\\\\.)|(.*[^']$)" | |||
skip: "(\\\\.)|([^']{4}'[^']{4})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...And this doesn't make sense to me at all, and doesn't work at all. Even a very basic case like 123'45
is not highlighted as a number.
Instead of these hacks, it seems that the skip
regex should contain the actual regex for a number containing single quotes (to ensure that it, well, actually skips all such numbers). I've tried this:
skip: "(\\\\.)|(\\b[1-9][0-9]*'[0-9']*[0-9]\\b)"
and it seems to work fine, although only for decimal integers without U
and L
suffixes. So also need to add the rest here: hexadecimals, floating point, U
and L
suffixes and so on. Yes, the resulting regex is gonna look horrifying, but it seems there is no other way to make it actually work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the test case I am using atm.
char testGoodChar = 'a'
char testBadChar = 'aba'
char testGoodInvis = '\t'
char testBadInvis = '\tn'
int testBinary = 0b0000'0000;
int testHex = 0x0AB0'aA20;
int testNormal = 2;
//Good char if statement
if(testChar[i] == '.')
//Good invisible char if statement
if(testChar[i] == '\t')
//Bad invisible char if statement
if(testChar[i] == '\tn' && 'fs')
const std::unordered_set<char> CharTokens = {'(', ')', ';', '{', '}', '/', '#', ',', '=', '<', '>'};
int test = 123'45;
//Bad char if statement
if(testChar[i] == '.sf')
//Good hex If statement
if(testHex[i] == 0x1234'1234)
//Good binary If statement
if(testBin[i] == 0b0000'1111)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what exactly do you mean by "leaking", but I already see that my change doesn't work correctly in such a basic case:
123'45 'j'
'j'
is highlighted as error, while it is obviously a valid character literal.
But, I've tried wit @JoeKar 's PR #3127 and my change works correctly with it.
So it is yet another misbehavior of micro's highlighting of syntax regions, which should be fixed by #3127 once we merge it (after I find some time to properly review it, which I'm constantly struggling to).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the screenshot I replied with, line -21 (up 21) and the line my cursor is on (the one with 26) is highlighting characters to the next line until it meets the next '
. Yellow text is the character highlighting.
That screenshot is using your change but this behavior exists even for current master.
This is the main thing I want to address in this PR mainly since working on a codebase with '
as a separator for binary or hex is very inconvenient because it will highlight anything until it hits the next '
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is how your example looks with the newest version of your PR:
And this is how it looks with #3127 + the following change:
--- a/runtime/syntax/cpp.yaml
+++ b/runtime/syntax/cpp.yaml
@@ -30,7 +30,7 @@ rules:
# Parenthetical Color
- symbol.brackets: "[(){}]|\\[|\\]"
# Integer Literals
- - constant.number: "(\\b([1-9][0-9']*|0[0-7']*|0[Xx][0-9a-fA-F']+|0[Bb][01]+)([Uu]?[Ll][Ll]?|[Ll][Ll]?[Uu]?)?\\b)"
+ - constant.number: "(\\b([1-9][0-9']*|0[0-7']*|0[Xx][0-9a-fA-F']+|0[Bb][01']+)([Uu]?[Ll][Ll]?|[Ll][Ll]?[Uu]?)?\\b)"
# Decimal Floating-point Literals
- constant.number: "(\\b(([0-9']*[.][0-9']+|[0-9']+[.][0-9']*)([Ee][+-]?[0-9']+)?|[0-9']+[Ee][+-]?[0-9']+)[FfLl]?\\b)"
# Hexadecimal Floating-point Literals
@@ -47,7 +47,7 @@ rules:
- constant.string:
start: "'"
end: "'"
- skip: "\\\\."
+ skip: "(\\\\.)|(\\b[1-9][0-9]*'[0-9']*[0-9]|0[0-7]+'[0-7']*[0-7]|0[Xx][0-9A-Fa-f]+'[0-9A-Fa-f']*[0-9A-Fa-f]|0[Bb][01]+'[01']*[01]\\b)"
rules:
- error: "..+"
- constant.specialChar: "\\\\([\"'abfnrtv\\\\]|[0-3]?[0-7]{1,2}|x[0-9A-Fa-f]{1,2}|u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8})"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that indeed works 👍
I have modified the changes to that for skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] after I find some time to properly review it [...]
And I fear that moment, when you finish the review. 🥲
902c692
to
d5ee3f1
Compare
rules: | ||
- error: "..+" | ||
- error: "[[:graph:]]{2,}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, this is buggy: 'ab'
is no longer highlighted as an error. The point of the original "..+"
regex is that any character literal containing more than one character is an error (unless it is an escaped character like '\n'
, which is covered by the skip
regex).
Second, what is this for anyway? The skip
regex should already ensure that numbers with single-quotes are skipped, so we should not need to mess with this error
regex.
Actually I get what this is for: it is a nasty workaround for the bug in micro causing the skip
regex being insufficient in some cases (e.g. 123'45 'a'
). As I said previously, this bug seems to be fixed in #3127. I've checked that with #3127 + your PR + this error
regex changed back to "..+"
it works correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmaluka
Could you confirm this again? - error: "[[:graph:]]{2,}'"
is highlighting 'ab'
for me.
The whole point of this change is because "..+"
is not working properly with
const std::unordered_set<char> CharTokens = {'(', ')', ';', '{', '}', '/', '#', ',', '=', '<', '>'};
I have not tried the change in #3127 but if that indeed fixes, I don't mind going back to "..+"
But that means this PR would need to be merged after it though otherwise the line I mentioned above will be highlighted as error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've double-checked: with your PR as is, 'ab'
is correctly highlighted as error. But with your PR applied on top of #3127 it is not. And then, after I replace your error: "[[:graph:]]{2,}'"
with the original error: "..+"
, it is correctly highlighted as error again.
I think this "[[:graph:]]{2,}'"
is obviously incorrect (the error regex should not include the single quote, since the single quote is not a part of the string that is highlighted as an error), and happens to somehow work correctly just because, as I said, there is a bug in micro.
runtime/syntax/cpp.yaml
Outdated
@@ -47,9 +47,9 @@ rules: | |||
- constant.string: | |||
start: "'" | |||
end: "'" | |||
skip: "\\\\." | |||
skip: "(\\\\.)|(\\b[1-9][0-9]*'[0-9']*[0-9]|0[0-7]+'[0-7']*[0-7]|0[Xx][0-9A-Fa-f]+'[0-9A-Fa-f']*[0-9A-Fa-f]|0[Bb][01]+'[01']*[01]\\b)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean this to be the final implementation. This is still incomplete: it doesn't include U
and L
cases, floating point etc. (Yes, we need all that here.)
Also I've realized this is slightly incorrect (apologies for that): need to add parentheses between \b
, i.e.: (\\\\.)|(\\b(...)\\b)
to ensure that it has word boundaries in all cases (e.g. that it won't skip 0xAA'BBQQ
, since 0xAA'BB
is not separated from QQ
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I've just realized that in this skip
regex we probably don't need to require the skipped number to contain a single quote, it can be a normal number literal without single quotes, and the result will be the same. IOW the patterns here can be the same as in the constant.number
's regexes, e.g. [1-9][0-9']*[0-9]
instead of [1-9][0-9]*'[0-9']*[0-9]
. The regex then will be simpler (and will work slightly faster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted this skip to be close to what you are saying. Unlike constant.number
, this just needs to skip anything with '
in between numbers and don't need to care about decimal. It should work fine now.
runtime/syntax/cpp.yaml
Outdated
- constant.number: "(\\b([1-9][0-9']*|0[0-7']*|0[Xx][0-9a-fA-F']+|0[Bb][01]+)([Uu]?[Ll][Ll]?|[Ll][Ll]?[Uu]?)?\\b)" | ||
- constant.number: "(\\b([0-9]|0[0-7]|0[Xx][0-9A-Fa-f]|0[Bb][01])([Uu][Ll]?[Ll]?|[Ll][Ll]?[Uu]?)?\\b)" | ||
- constant.number: "(\\b([1-9][0-9']*[0-9]|0[0-7][0-7']*[0-7])([Uu][Ll]?[Ll]?|[Ll][Ll]?[Uu]?)?\\b)" | ||
- constant.number: "(\\b(0[Xx][0-9A-Fa-f][0-9A-Fa-f']*[0-9A-Fa-f]|0[Bb][01][01']*[01])([Uu][Ll]?[Ll]?|[Ll][Ll]?[Uu]?)?\\b)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this split looks a bit messy. I'd instead split it into 4 regexes for the 4 clearly separated cases: decimals, octals, hexadecimals and binaries. E.g. for decimals: (\\b([1-9]|[1-9][0-9']*[0-9])\\b)
and for the rest similarly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitted
runtime/syntax/cpp.yaml
Outdated
# Decimal Floating-point Literals | ||
- constant.number: "(\\b(([0-9']*[.][0-9']+|[0-9']+[.][0-9']*)([Ee][+-]?[0-9']+)?|[0-9']+[Ee][+-]?[0-9']+)[FfLl]?\\b)" | ||
- constant.number: "(\\b(([.][0-9']*[0-9]|[0-9][0-9']*[.]|[0-9][0-9']*[.][0-9']*[0-9])([Ee][+-]?[0-9']+)?|[0-9']+[Ee][+-]?[0-9']+)[FfLl]?\\b)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this still allows invalid literals like 0'.5
and 0.'5
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
I'm thinking that maybe we shouldn't add these changes to So since there is no practical need to add this for C, maybe we should leave C alone, and just let C++ programmers suffer the performance penalty. |
8b62263
to
332e515
Compare
I have just updated the branch to split into all the categories and scenarios for both interger literal and decimal literal. I have only updated for @dmaluka However, I disagree on
Having consistency between these 2 will make it slightly easier to maintain. It is relatively rare to open/edit a file as large as ~40000 lines. Of course, I would not want to have a performance impact but I feel like if we are opening a large file like that, we should only be syntax highlighting a portion of a file (with some clever method) or even no syntax highlighting at all. This is the updated test case btw char testGoodChar = 'a'
char testBadChar = 'ab'
char testBadChar = 'aba'
char testGoodInvis = '\t'
char testBadInvis = '\tn'
int testBinary = 0b0000'0000;
int testHex = 0x0AB0'aA20;
int testNormal = 2;
//Good char if statement
if(testChar[i] == '.')
//Good invisible char if statement
if(testChar[i] == '\t')
//Bad invisible char if statement
if(testChar[i] == '\tn' && 'fs')
const std::unordered_set<char> CharTokens = {'(', ')', ';', '{', '}', '/', '#', ',', '=', '<', '>'};
int test = 123'45;
//Bad char if statement
if(testChar[i] == '.sf')
//Good hex If statement
if(testHex[i] == 0x1234'1234)
float testFloat = 0x1A.0x1A'1B97;
//Good binary If statement
if(testBin[i] == 0b0000'1111)
int d = 42;
int o = 052'05;
int x = 0x2a;
int X = 0X2A;
int b = 0b101010; // C++14
unsigned long long l1 = 18446744073709550592ull; // C++11
unsigned long long l2 = 18'446'744'073'709'550'592llu; // C++14
unsigned long long l3 = 1844'6744'0737'0955'0592uLL; // C++14
unsigned long long l4 = 184467'440737'0'95505'92LLU; // C++14
456'456.
.13248'4456
0x.456'45
58.
4e2
123.456e-67
123.456e-67f
.1E4f
0x10.1p0
0x1p5
0x1e5
3.14'15'92
1.18e-4932l
3.4028234e38f
3.4028234e38
3.4028234e38l After this PR, I probably don't want to touch regex for a while 😂 |
If a C++ header is highlighted using Micro uses a signature to try to detect when it should highlight
Vim does that. As a result, it often highlights incorrectly, since it doesn't know the correct syntax state before the displayed portion of the file in the general case. |
Fair enough, I didn't know that. I thought it just uses file extension to determine it. I can add a few more c++ keywords to the signature to make it more robust.
I still think we shouldn't do regex on the whole file if it is too large. Something like separating file by continuous empty newlines might work but that's for the future. Anyway, I didn't know about the signature part so you are right, it is not a bad idea to just apply this change to I will revert the changes on |
I hope we will merge #3127 soon. If not, we should at least add a TODO comment to not forget to remove this |
@dmaluka
It doesn't work with const std::unordered_set<char> CharTokens = {'(', ')', ';', '{', '}', '/', '#', ',', '=', '<', '>'}; I have no idea why |
70e8826
to
95dcbc0
Compare
@@ -42,7 +42,8 @@ rules: | |||
end: "'" | |||
skip: "\\\\." | |||
rules: | |||
- error: "..+" | |||
# TODO: Revert back to - error: "..+" once #3127 is merged | |||
- error: "[[:graph:]]{2,}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need it in c.yaml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work with something like
const char charArray[] = {'(', ')', ';', '{', '}', '/', '#', ',', '=', '<', '>'};
right now
runtime/syntax/cpp.yaml
Outdated
@@ -2,7 +2,7 @@ filetype: c++ | |||
|
|||
detect: | |||
filename: "(\\.c(c|pp|xx)$|\\.h(h|pp|xx)?$|\\.ii?$|\\.(def)$)" | |||
signature: "namespace|template|public|protected|private" | |||
signature: "namespace|class|public|protected|private|template|constexpr|noexcept|nullptr|try|throw" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in a separate commit.
Also "try" might be too common word.
Also I believe all this should be guarded with \b
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to separate commit
95dcbc0
to
7b50629
Compare
Modifying the syntax highlighting behavior for binary and hex literal single quote separator which is available from C++14 and C23
My test case
Currently:
Note that Red is highlighted as error from the syntax highlighting
New Change: