Skip to content

Conversation

@ryenus
Copy link
Contributor

@ryenus ryenus commented Jun 22, 2025

In earlier implementation of the command screen, the process command was treated as a sequence of single-byte characters. When there are wide chars, especially UTF-8 byte sequences in the command, it's possible to have part of the multi-byte sequence wrapped to next line.

To fix that we now leverage mbrtowc and wcwidth to know the byte count and character width for non-ASCII characters, so that we can now wrap near window edge and more importantly at character boundaries.

@BenBE Would you mind having a look at this when you get a chance?

@ryenus ryenus changed the title Wrap cmd wide CommandScreen: handle multi-byte UTF-8 code sequences while wrapping Jun 22, 2025
@ryenus ryenus force-pushed the wrap-cmd-wide branch 4 times, most recently from cf02101 to 4f202bb Compare June 22, 2025 15:26
@Explorer09
Copy link
Contributor

I have been trying to make a few PRs (pull requests) that could bring Unicode character width support in htop. So you are not the first one that proposed the idea.

There are many places in htop codebase that need to be upgraded for Unicode character width support. So your width calculation function would be better not limited to CommandScreen.c use.

To avoid duplicate effort, maybe you could look at my PR #1642 to see what you can do with the width calculation.

@ryenus ryenus force-pushed the wrap-cmd-wide branch 2 times, most recently from 35279f5 to 8b6f9bc Compare June 24, 2025 15:44
@ryenus
Copy link
Contributor Author

ryenus commented Jun 25, 2025

@Explorer09, thanks a lot for sharing your thoughts and pointing me to your work on related topics. I really appreciate your insights and all the effort you’ve put into this area.

For this PR, I’m mainly focused on implementing proper line wrapping at or near the window border. I agree that a broader, unified solution for wide character support, especially an improved control character handling worth looking. However, I think those are best addressed in their own dedicated PRs and discussions, such as what you’re working on in #1642. For now, I’d prefer to keep the scope here limited so we can resolve this specific issue first.

Thanks again for your feedback and for helping move things forward!

Copy link
Member

@BenBE BenBE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, having this in one function is still reasonably readable:

static int CommandScreen_scanString(InfoScreen* this, const char* p, size_t total, char* line, bool wide) {
   int line_offset = 0;
   int line_cols = 0;
   int last_spc_offset = -1;
   int last_spc_cols = -1;

#ifdef HAVE_LIBNCURSESW
   mbstate_t state;
   memset(&state, 0, sizeof(state));
#endif

   for (size_t i = 0; i < total;) {
      assert(line_offset >= 0 && (size_t)line_offset <= total);

      unsigned char c = (unsigned char)p[i];
      int bytes = 1;
      int width = 1;

      // Handle character encoding based on mode
      if (!wide || c < 0x80) {
         line[line_offset] = c;
         if (c == ' ') {
            last_spc_offset = line_offset;
            last_spc_cols = line_cols;
         }
      }
#ifdef HAVE_LIBNCURSESW
      else {
         wchar_t wc = 0;
         bytes = mbrtowc(&wc, p + i, total - i, &state);
         if (bytes != (size_t)-1 && bytes != (size_t)-2) {
            width = wcwidth(wc);
            width = MAXIMUM(width, 1);
         } else {
            bytes = 1;
         }

         assert(line_offset + bytes <= total + 1);
         assert(i + bytes <= total);
         memcpy(line + line_offset, p + i, bytes);
      }
#endif

      i += bytes;
      line_offset += bytes;
      line_cols += width;

      // Check if we need to wrap the line
      if (line_cols >= COLS) {
         int line_size;

         if (last_spc_offset >= 0) {
            // Wrap at last space
            line_size = last_spc_offset;
            line_cols -= last_spc_cols;
            last_spc_offset = last_spc_cols = -1;
         } else if (line_cols > COLS && wide) {
            // Current character would exceed COLS, wrap before it
            line_size = line_offset - bytes;
            line_cols = width;
         } else {
            // Wrap at current position
            line_size = line_offset;
            line_cols = 0;
         }

         line[line_size] = '\0';
         InfoScreen_addLine(this, line);

         // Move any remaining content to the beginning
         line_offset -= line_size;
         if (line_offset > 0) {
            assert(i >= line_offset);
            assert(i <= total);
            assert(line_offset <= total + 1);
            memcpy(line, p + i - line_offset, line_offset);
         }
      }
   }

   return line_offset;
}

static void CommandScreen_scan(InfoScreen* this) {
   Panel* panel = this->display;
   int idx = Panel_getSelectedIndex(panel);
   Panel_prune(panel);

   const char* p = Process_getCommand(this->process);
   assert(p != NULL);
   size_t total = strlen(p);
   char line[total + 1];

   int line_offset = CommandScreen_scanString(this, p, total, line, CRT_utf8);

   assert(line_offset >= 0 && (size_t)line_offset <= total);
   if (line_offset > 0) {
      line[line_offset] = '\0';
      InfoScreen_addLine(this, line);
   }

   Panel_setSelected(panel, MAXIMUM(idx, 0));
}

Note this is mostly a brute-force mashup on your two functions with some restructuring, but unless I missed something this should work as expected.

The else { would for completeness actually be a else if (wide) { but the reasoning for why it's not needed to cover everything, should be easy to see.

A further refinement one could do is making adding a line governed by a callback typedef void (* line_cb_t)(void* ctx, const char* line), with also the final line being handled as part of the string wrapping code.

const char* p = Process_getCommand(this->process);
assert(p != NULL);
size_t total = strlen(p);
char line[total + 1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid variable length arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be line[COL + 1], however due to it's possible for a unicode character to have arbitrary combining marks, like diacritics, therefore COL + 1 is not sufficient and would cause segfault. therefore strlen(p) + 1 is probably the only safe option out there.

With that said, I wonder if you mean to use malloc instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strlen(p) is a bad idea as it scans the entire string rather than just stops at the potential line breaking position.

You can reference my code String_mbswidth in #1642 if you need some inspiration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is about the VLA itself. And yes, a malloc is the preferred way.

That way you avoid having any dynamically sized buffers anywhere near the stack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I just don't like duplicated efforts. I remembered I have done this "calculate string width and length with respect to whitespace line break" before. It was in #1648. I just need to split that out into a dedicated function so that it can be reused in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an infrastructural change needed for htop to fully support Unicode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, maybe we could just wait for String_split to magically appear :P

There is an existing function String_split. I meant more like creating a String_wrap function with similar API.

Jokes aside, I honestly don't have enough spare time right now to implement a String_split that would cover all use cases throughout the entire htop codebase.

If a first version only covers the cases used on CommandScreen.c that's totally fine. @Explorer09 can take up the work from there if you like.

It seems like #1105 ran into similar challenges and ended up stalled, which is unfortunate.

Multibyte support is quite tricky to get right, in particular when dealing with input handling.

Copy link
Contributor Author

@ryenus ryenus Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a first version only covers the cases used on CommandScreen.c that's totally fine. @Explorer09 can take up the work from there if you like.

Yes, that would be great :-)

Copy link
Contributor

@Explorer09 Explorer09 Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made String_lineBreakWidth() in 2ddc311 that should be able to work with your use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack buffer is still there …

i += bytes;
line_offset += bytes;
line_cols += width;
if (line_cols < COLS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COLS should ideally be an argument to the string wrap function.

@ryenus
Copy link
Contributor Author

ryenus commented Jun 25, 2025

@BenBE actually the code had been like that, in the sense that everything was in just one function, it was later split up :-) I just pushed the earlier commits for your reference.

ryenus added a commit to ryenus/htop that referenced this pull request Jun 25, 2025
ryenus added a commit to ryenus/htop that referenced this pull request Jul 7, 2025
ryenus added a commit to ryenus/htop that referenced this pull request Jul 27, 2025
create separate functions for UTF-8/ASCII scanning

expand line buffer to handle UTF-8 combining marks

To ensure sufficient space to handle UTF-8 code sequences, especially
unicode characters with arbitrary combining marks or diacritics, which
might take more bytes than the number of visual columns.

skip UTF-8 codepoint decoding for ASCII characters

add missing include and ifdef guard for CRT_utf8

check mbrtowc before char width calculation

adjust the order of include directives

thanks to htop-dev#1726 (comment)

Co-authored-by: BenBE <[email protected]>
ryenus added a commit to ryenus/htop that referenced this pull request Aug 23, 2025
- add missing include and ifdef guard for CRT_utf8
- adjust the order of include directives
- create separate functions for UTF-8/ASCII scanning
- check mbrtowc before char width calculation
- skip UTF-8 codepoint decoding for ASCII characters
- expand line buffer to handle UTF-8 combining marks
  To ensure sufficient space to handle UTF-8 code sequences, especially
  unicode characters with arbitrary combining marks or diacritics, which
  might take more bytes than the number of visual columns.

Co-authored-by: BenBE <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants