-
-
Couldn't load subscription status.
- Fork 530
CommandScreen: handle multi-byte UTF-8 code sequences while wrapping #1726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cf02101 to
4f202bb
Compare
|
I have been trying to make a few PRs (pull requests) that could bring Unicode character width support in htop. So you are not the first one that proposed the idea. There are many places in htop codebase that need to be upgraded for Unicode character width support. So your width calculation function would be better not limited to To avoid duplicate effort, maybe you could look at my PR #1642 to see what you can do with the width calculation. |
35279f5 to
8b6f9bc
Compare
|
@Explorer09, thanks a lot for sharing your thoughts and pointing me to your work on related topics. I really appreciate your insights and all the effort you’ve put into this area. For this PR, I’m mainly focused on implementing proper line wrapping at or near the window border. I agree that a broader, unified solution for wide character support, especially an improved control character handling worth looking. However, I think those are best addressed in their own dedicated PRs and discussions, such as what you’re working on in #1642. For now, I’d prefer to keep the scope here limited so we can resolve this specific issue first. Thanks again for your feedback and for helping move things forward! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, having this in one function is still reasonably readable:
static int CommandScreen_scanString(InfoScreen* this, const char* p, size_t total, char* line, bool wide) {
int line_offset = 0;
int line_cols = 0;
int last_spc_offset = -1;
int last_spc_cols = -1;
#ifdef HAVE_LIBNCURSESW
mbstate_t state;
memset(&state, 0, sizeof(state));
#endif
for (size_t i = 0; i < total;) {
assert(line_offset >= 0 && (size_t)line_offset <= total);
unsigned char c = (unsigned char)p[i];
int bytes = 1;
int width = 1;
// Handle character encoding based on mode
if (!wide || c < 0x80) {
line[line_offset] = c;
if (c == ' ') {
last_spc_offset = line_offset;
last_spc_cols = line_cols;
}
}
#ifdef HAVE_LIBNCURSESW
else {
wchar_t wc = 0;
bytes = mbrtowc(&wc, p + i, total - i, &state);
if (bytes != (size_t)-1 && bytes != (size_t)-2) {
width = wcwidth(wc);
width = MAXIMUM(width, 1);
} else {
bytes = 1;
}
assert(line_offset + bytes <= total + 1);
assert(i + bytes <= total);
memcpy(line + line_offset, p + i, bytes);
}
#endif
i += bytes;
line_offset += bytes;
line_cols += width;
// Check if we need to wrap the line
if (line_cols >= COLS) {
int line_size;
if (last_spc_offset >= 0) {
// Wrap at last space
line_size = last_spc_offset;
line_cols -= last_spc_cols;
last_spc_offset = last_spc_cols = -1;
} else if (line_cols > COLS && wide) {
// Current character would exceed COLS, wrap before it
line_size = line_offset - bytes;
line_cols = width;
} else {
// Wrap at current position
line_size = line_offset;
line_cols = 0;
}
line[line_size] = '\0';
InfoScreen_addLine(this, line);
// Move any remaining content to the beginning
line_offset -= line_size;
if (line_offset > 0) {
assert(i >= line_offset);
assert(i <= total);
assert(line_offset <= total + 1);
memcpy(line, p + i - line_offset, line_offset);
}
}
}
return line_offset;
}
static void CommandScreen_scan(InfoScreen* this) {
Panel* panel = this->display;
int idx = Panel_getSelectedIndex(panel);
Panel_prune(panel);
const char* p = Process_getCommand(this->process);
assert(p != NULL);
size_t total = strlen(p);
char line[total + 1];
int line_offset = CommandScreen_scanString(this, p, total, line, CRT_utf8);
assert(line_offset >= 0 && (size_t)line_offset <= total);
if (line_offset > 0) {
line[line_offset] = '\0';
InfoScreen_addLine(this, line);
}
Panel_setSelected(panel, MAXIMUM(idx, 0));
}Note this is mostly a brute-force mashup on your two functions with some restructuring, but unless I missed something this should work as expected.
The else { would for completeness actually be a else if (wide) { but the reasoning for why it's not needed to cover everything, should be easy to see.
A further refinement one could do is making adding a line governed by a callback typedef void (* line_cb_t)(void* ctx, const char* line), with also the final line being handled as part of the string wrapping code.
| const char* p = Process_getCommand(this->process); | ||
| assert(p != NULL); | ||
| size_t total = strlen(p); | ||
| char line[total + 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid variable length arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to be line[COL + 1], however due to it's possible for a unicode character to have arbitrary combining marks, like diacritics, therefore COL + 1 is not sufficient and would cause segfault. therefore strlen(p) + 1 is probably the only safe option out there.
With that said, I wonder if you mean to use malloc instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strlen(p) is a bad idea as it scans the entire string rather than just stops at the potential line breaking position.
You can reference my code String_mbswidth in #1642 if you need some inspiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is about the VLA itself. And yes, a malloc is the preferred way.
That way you avoid having any dynamically sized buffers anywhere near the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I just don't like duplicated efforts. I remembered I have done this "calculate string width and length with respect to whitespace line break" before. It was in #1648. I just need to split that out into a dedicated function so that it can be reused in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an infrastructural change needed for htop to fully support Unicode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, maybe we could just wait for String_split to magically appear :P
There is an existing function String_split. I meant more like creating a String_wrap function with similar API.
Jokes aside, I honestly don't have enough spare time right now to implement a String_split that would cover all use cases throughout the entire htop codebase.
If a first version only covers the cases used on CommandScreen.c that's totally fine. @Explorer09 can take up the work from there if you like.
It seems like #1105 ran into similar challenges and ended up stalled, which is unfortunate.
Multibyte support is quite tricky to get right, in particular when dealing with input handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a first version only covers the cases used on CommandScreen.c that's totally fine. @Explorer09 can take up the work from there if you like.
Yes, that would be great :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made String_lineBreakWidth() in 2ddc311 that should be able to work with your use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stack buffer is still there …
| i += bytes; | ||
| line_offset += bytes; | ||
| line_cols += width; | ||
| if (line_cols < COLS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COLS should ideally be an argument to the string wrap function.
|
@BenBE actually the code had been like that, in the sense that everything was in just one function, it was later split up :-) I just pushed the earlier commits for your reference. |
thanks to htop-dev#1726 (comment) Co-authored-by: BenBE <[email protected]>
thanks to htop-dev#1726 (comment) Co-authored-by: BenBE <[email protected]>
create separate functions for UTF-8/ASCII scanning expand line buffer to handle UTF-8 combining marks To ensure sufficient space to handle UTF-8 code sequences, especially unicode characters with arbitrary combining marks or diacritics, which might take more bytes than the number of visual columns. skip UTF-8 codepoint decoding for ASCII characters add missing include and ifdef guard for CRT_utf8 check mbrtowc before char width calculation adjust the order of include directives thanks to htop-dev#1726 (comment) Co-authored-by: BenBE <[email protected]>
thanks to htop-dev#1726 (comment) Co-authored-by: BenBE <[email protected]>
- add missing include and ifdef guard for CRT_utf8 - adjust the order of include directives - create separate functions for UTF-8/ASCII scanning - check mbrtowc before char width calculation - skip UTF-8 codepoint decoding for ASCII characters - expand line buffer to handle UTF-8 combining marks To ensure sufficient space to handle UTF-8 code sequences, especially unicode characters with arbitrary combining marks or diacritics, which might take more bytes than the number of visual columns. Co-authored-by: BenBE <[email protected]>
In earlier implementation of the command screen, the process command was treated as a sequence of single-byte characters. When there are wide chars, especially UTF-8 byte sequences in the command, it's possible to have part of the multi-byte sequence wrapped to next line.
To fix that we now leverage
mbrtowcandwcwidthto know the byte count and character width for non-ASCII characters, so that we can now wrap near window edge and more importantly at character boundaries.@BenBE Would you mind having a look at this when you get a chance?