CommandScreen: handle multi-byte UTF-8 code sequences while wrapping #1726

ryenus · 2025-06-22T14:22:58Z

In earlier implementation of the command screen, the process command was treated as a sequence of single-byte characters. When there are wide chars, especially UTF-8 byte sequences in the command, it's possible to have part of the multi-byte sequence wrapped to next line.

To fix that we now leverage mbrtowc and wcwidth to know the byte count and character width for non-ASCII characters, so that we can now wrap near window edge and more importantly at character boundaries.

@BenBE Would you mind having a look at this when you get a chance?

Explorer09 · 2025-06-22T15:43:59Z

I have been trying to make a few PRs (pull requests) that could bring Unicode character width support in htop. So you are not the first one that proposed the idea.

There are many places in htop codebase that need to be upgraded for Unicode character width support. So your width calculation function would be better not limited to CommandScreen.c use.

To avoid duplicate effort, maybe you could look at my PR #1642 to see what you can do with the width calculation.

CommandScreen.c

ryenus · 2025-06-25T03:21:38Z

@Explorer09, thanks a lot for sharing your thoughts and pointing me to your work on related topics. I really appreciate your insights and all the effort you’ve put into this area.

For this PR, I’m mainly focused on implementing proper line wrapping at or near the window border. I agree that a broader, unified solution for wide character support, especially an improved control character handling worth looking. However, I think those are best addressed in their own dedicated PRs and discussions, such as what you’re working on in #1642. For now, I’d prefer to keep the scope here limited so we can resolve this specific issue first.

Thanks again for your feedback and for helping move things forward!

BenBE

I think, having this in one function is still reasonably readable:

static int CommandScreen_scanString(InfoScreen* this, const char* p, size_t total, char* line, bool wide) {
   int line_offset = 0;
   int line_cols = 0;
   int last_spc_offset = -1;
   int last_spc_cols = -1;

#ifdef HAVE_LIBNCURSESW
   mbstate_t state;
   memset(&state, 0, sizeof(state));
#endif

   for (size_t i = 0; i < total;) {
      assert(line_offset >= 0 && (size_t)line_offset <= total);

      unsigned char c = (unsigned char)p[i];
      int bytes = 1;
      int width = 1;

      // Handle character encoding based on mode
      if (!wide || c < 0x80) {
         line[line_offset] = c;
         if (c == ' ') {
            last_spc_offset = line_offset;
            last_spc_cols = line_cols;
         }
      }
#ifdef HAVE_LIBNCURSESW
      else {
         wchar_t wc = 0;
         bytes = mbrtowc(&wc, p + i, total - i, &state);
         if (bytes != (size_t)-1 && bytes != (size_t)-2) {
            width = wcwidth(wc);
            width = MAXIMUM(width, 1);
         } else {
            bytes = 1;
         }

         assert(line_offset + bytes <= total + 1);
         assert(i + bytes <= total);
         memcpy(line + line_offset, p + i, bytes);
      }
#endif

      i += bytes;
      line_offset += bytes;
      line_cols += width;

      // Check if we need to wrap the line
      if (line_cols >= COLS) {
         int line_size;

         if (last_spc_offset >= 0) {
            // Wrap at last space
            line_size = last_spc_offset;
            line_cols -= last_spc_cols;
            last_spc_offset = last_spc_cols = -1;
         } else if (line_cols > COLS && wide) {
            // Current character would exceed COLS, wrap before it
            line_size = line_offset - bytes;
            line_cols = width;
         } else {
            // Wrap at current position
            line_size = line_offset;
            line_cols = 0;
         }

         line[line_size] = '\0';
         InfoScreen_addLine(this, line);

         // Move any remaining content to the beginning
         line_offset -= line_size;
         if (line_offset > 0) {
            assert(i >= line_offset);
            assert(i <= total);
            assert(line_offset <= total + 1);
            memcpy(line, p + i - line_offset, line_offset);
         }
      }
   }

   return line_offset;
}

static void CommandScreen_scan(InfoScreen* this) {
   Panel* panel = this->display;
   int idx = Panel_getSelectedIndex(panel);
   Panel_prune(panel);

   const char* p = Process_getCommand(this->process);
   assert(p != NULL);
   size_t total = strlen(p);
   char line[total + 1];

   int line_offset = CommandScreen_scanString(this, p, total, line, CRT_utf8);

   assert(line_offset >= 0 && (size_t)line_offset <= total);
   if (line_offset > 0) {
      line[line_offset] = '\0';
      InfoScreen_addLine(this, line);
   }

   Panel_setSelected(panel, MAXIMUM(idx, 0));
}

Note this is mostly a brute-force mashup on your two functions with some restructuring, but unless I missed something this should work as expected.

The else { would for completeness actually be a else if (wide) { but the reasoning for why it's not needed to cover everything, should be easy to see.

A further refinement one could do is making adding a line governed by a callback typedef void (* line_cb_t)(void* ctx, const char* line), with also the final line being handled as part of the string wrapping code.

CommandScreen.c

BenBE · 2025-06-25T09:01:25Z

CommandScreen.c

+   const char* p = Process_getCommand(this->process);
+   assert(p != NULL);
+   size_t total = strlen(p);
+   char line[total + 1];


Avoid variable length arrays.

This used to be line[COL + 1], however due to it's possible for a unicode character to have arbitrary combining marks, like diacritics, therefore COL + 1 is not sufficient and would cause segfault. therefore strlen(p) + 1 is probably the only safe option out there.

With that said, I wonder if you mean to use malloc instead?

The strlen(p) is a bad idea as it scans the entire string rather than just stops at the potential line breaking position.

You can reference my code String_mbswidth in #1642 if you need some inspiration.

The comment is about the VLA itself. And yes, a malloc is the preferred way.

That way you avoid having any dynamically sized buffers anywhere near the stack.

Note: I just don't like duplicated efforts. I remembered I have done this "calculate string width and length with respect to whitespace line break" before. It was in #1648. I just need to split that out into a dedicated function so that it can be reused in this PR.

There's an infrastructural change needed for htop to fully support Unicode.

Haha, maybe we could just wait for String_split to magically appear :P

There is an existing function String_split. I meant more like creating a String_wrap function with similar API.

Jokes aside, I honestly don't have enough spare time right now to implement a String_split that would cover all use cases throughout the entire htop codebase.

If a first version only covers the cases used on CommandScreen.c that's totally fine. @Explorer09 can take up the work from there if you like.

It seems like #1105 ran into similar challenges and ended up stalled, which is unfortunate.

Multibyte support is quite tricky to get right, in particular when dealing with input handling.

If a first version only covers the cases used on CommandScreen.c that's totally fine. @Explorer09 can take up the work from there if you like.

Yes, that would be great :-)

I made String_lineBreakWidth() in 2ddc311 that should be able to work with your use case.

The stack buffer is still there …

BenBE · 2025-06-25T09:03:13Z

CommandScreen.c

+      i += bytes;
+      line_offset += bytes;
+      line_cols += width;
+      if (line_cols < COLS) {


COLS should ideally be an argument to the string wrap function.

ryenus · 2025-06-25T09:42:50Z

@BenBE actually the code had been like that, in the sense that everything was in just one function, it was later split up :-) I just pushed the earlier commits for your reference.

thanks to htop-dev#1726 (comment) Co-authored-by: BenBE <[email protected]>

create separate functions for UTF-8/ASCII scanning expand line buffer to handle UTF-8 combining marks To ensure sufficient space to handle UTF-8 code sequences, especially unicode characters with arbitrary combining marks or diacritics, which might take more bytes than the number of visual columns. skip UTF-8 codepoint decoding for ASCII characters add missing include and ifdef guard for CRT_utf8 check mbrtowc before char width calculation adjust the order of include directives thanks to htop-dev#1726 (comment) Co-authored-by: BenBE <[email protected]>

thanks to htop-dev#1726 (comment) Co-authored-by: BenBE <[email protected]>

- add missing include and ifdef guard for CRT_utf8 - adjust the order of include directives - create separate functions for UTF-8/ASCII scanning - check mbrtowc before char width calculation - skip UTF-8 codepoint decoding for ASCII characters - expand line buffer to handle UTF-8 combining marks To ensure sufficient space to handle UTF-8 code sequences, especially unicode characters with arbitrary combining marks or diacritics, which might take more bytes than the number of visual columns. Co-authored-by: BenBE <[email protected]>

ryenus changed the title ~~Wrap cmd wide~~ CommandScreen: handle multi-byte UTF-8 code sequences while wrapping Jun 22, 2025

ryenus force-pushed the wrap-cmd-wide branch 4 times, most recently from cf02101 to 4f202bb Compare June 22, 2025 15:26

Explorer09 reviewed Jun 22, 2025

View reviewed changes

CommandScreen.c Show resolved Hide resolved

Explorer09 reviewed Jun 22, 2025

View reviewed changes

CommandScreen.c Outdated Show resolved Hide resolved

ryenus force-pushed the wrap-cmd-wide branch from 4f202bb to 8dc7e12 Compare June 23, 2025 12:04

Explorer09 reviewed Jun 23, 2025

View reviewed changes

CommandScreen.c Outdated Show resolved Hide resolved

ryenus force-pushed the wrap-cmd-wide branch 2 times, most recently from 35279f5 to 8b6f9bc Compare June 24, 2025 15:44

BenBE reviewed Jun 25, 2025

View reviewed changes

ryenus force-pushed the wrap-cmd-wide branch from 8b6f9bc to 35279f5 Compare June 25, 2025 09:39

ryenus added a commit to ryenus/htop that referenced this pull request Jun 25, 2025

adjust the order of include directives

6485dcb

thanks to htop-dev#1726 (comment) Co-authored-by: BenBE <[email protected]>

ryenus added a commit to ryenus/htop that referenced this pull request Jul 7, 2025

adjust the order of include directives

f902045

thanks to htop-dev#1726 (comment) Co-authored-by: BenBE <[email protected]>

ryenus force-pushed the wrap-cmd-wide branch from 6485dcb to f902045 Compare July 7, 2025 00:05

ryenus force-pushed the wrap-cmd-wide branch from f902045 to 9275f27 Compare August 23, 2025 05:36

ryenus added a commit to ryenus/htop that referenced this pull request Aug 23, 2025

adjust the order of include directives

9275f27

thanks to htop-dev#1726 (comment) Co-authored-by: BenBE <[email protected]>

ryenus force-pushed the wrap-cmd-wide branch from 9275f27 to 46c3d55 Compare August 23, 2025 06:26

Uh oh!

Uh oh!

CommandScreen: handle multi-byte UTF-8 code sequences while wrapping #1726

Are you sure you want to change the base?

CommandScreen: handle multi-byte UTF-8 code sequences while wrapping #1726

Uh oh!

Conversation

ryenus commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Explorer09 commented Jun 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryenus commented Jun 25, 2025

Uh oh!

BenBE left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryenus Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Explorer09 Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryenus commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ryenus commented Jun 22, 2025 •

edited

Loading

BenBE left a comment •

edited

Loading

ryenus Jun 26, 2025 •

edited

Loading

Explorer09 Jun 28, 2025 •

edited

Loading