Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-byte separator characters get split into their byte-components (unicode, UTF-8) #76

Open
Sorontik opened this issue Nov 21, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@Sorontik
Copy link

when you use a multi-byte separator character like "·" (middle dot, U+00B7), it get's split and only the first byte is used as separator character while the second byte is treated as part of the following column.

the middle dot in UTF-8 (which is a normal single character in unicode) is represented by the 2-byte sequence
\xC2 \xB7
and is displayed in NPP correctly, when the language is set to "normal text"

If you select CSVLint and manually enter the middle dot as column separator, NPP suddenly no longer displays the correct glyph but prints the binary replacement blocks for 'xC2' and 'xB7' instead.
when you look closely, you can see that the first byte \xC2 is displayed with neutral background for the separator char while the second byte \xB7 is displayed in the same color as the following column.

This leads me to believe that CSVLint only uses the first byte as separator and inserts the 'codes' for coloring between the two bytes, breaking them apart so NPP can no longer display them correctly

@BdR76
Copy link
Owner

BdR76 commented Dec 2, 2023

This is similar to issue #66 and you're right, it only takes the first byte, which was a hack to accomodate using characters like ¬ (char 172) as separator. I can reproduce your issue when opening a test file and replacing ; with the · (middle dot) character.

csv_lint_issue_76

It has to do with the fact that parameters are passed to the plugin using the ScintillaGateway.SetProperty function. The string parameter with the separator character is also passed like this, and it gets converted using UTF8.GetBytes.

At the moment I don't know how to fix this without breaking other things. I'll have to look into this more at a later time.

@BdR76 BdR76 added the bug Something isn't working label Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants