Unicode "Grapheme Break" support (user-perceived character breaks, with libunistring) #1603

Explorer09 · 2025-02-08T21:32:56Z

This is a discussion of a feature that would use a library.

htop often needs breaks of file paths of command lines, as well as other strings that may contain arbitrary Unicode text.

In discussions of #462 and #854, about shortening a path of a working directory for display, I think it becomes apparent that we need to support the Unicode algorithm of user-perceived character boundaries.

In Unicode, this is defined in UAX 29 section 3:
https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
with a normative database:
https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt

Not to be confused with Unicode line breaking algorithm (UAX 14).

What are the "Grapheme Cluster Boundaries" in Unicode and why are they important for htop?

When htop needs to truncate a file path or a command line argument string, it is supposed to remove "characters" from the path string until the string fits the given terminal column width. However, Unicode characters (code points) do not always have one-to-one relationships to the "characters" that user could perceive as one unit.

Special cases can include:

Combining character sequences
Character sequences with ZWJ (Zero Width Joiner, U+200D), commonly found in emoji sequences.
Regional indicator symbols (a.k.a. flag emojis). These regional indicator character should be in pairs and break cannot happen within each pair.
Languages that use a syllable as the minimum grapheme unit. This is the least concern for htop but I mention for completeness.

I don't think people would like a series of United Kingdom flag emojis (🇬🇧🇬🇧🇬🇧🇬🇧) suddenly transform into Bulgaria flag emojis plus a letter B (🇧🇬🇧🇬🇧🇬🇧) when characters are cut at a wrong location. This is just an example.

The grapheme cluster boundary algorithm requires a database, and I'm not wishing to embed the whole database into htop. What I am suggesting is to incorporate libunistring, a library from GNU that had the database for our needs. Specifically the unigbrk.h APIs.

The license for libunistring is "LGPLv3+ or GPLv2+" (dual-licensed), and so is compatible to htop's license (GPLv2+).

The text was updated successfully, but these errors were encountered:

BenBE · 2025-02-08T21:41:37Z

Is using the library doable in a way that makes runtime binding to this library feasible?
Or would this have to be a compile-time configure option?

Explorer09 · 2025-02-08T21:55:04Z

@BenBE It's likely a configure option with a possibly of statically linking it. The issue I see is the library's soname version is not quite stable as I would expect. Example:

https://packages.debian.org/bookworm/libunistring2
https://packages.debian.org/trixie/libunistring5

BenBE · 2025-02-08T22:56:31Z

How stable is its API? Or will we be chasing API change after API change for this lib?

Explorer09 · 2025-02-08T23:06:16Z

@BenBE The API I was considering right now is uc_graphemeclusterbreak_property(), that is, simply retrieving the "Grapheme_Cluster_Break" property from the database. The other functions can't work directly with htop's RichString so it's likely I need to build wrappers for them. (They can work with Unicode strings stored in uint8_t, uint16_t or uint32_t arrays but do not support a custom iterator that I need for RichString.)

rubyFeedback · 2025-02-19T23:35:18Z

Compiling libunistring (https://ftp.gnu.org/gnu/libunistring/libunistring-1.3.tar.xz) works fine for me, never had an issue, so this option would not affect me negatively (I think).

I do, however had, have one question:

Would it be possible to retain htop's old behaviour, e. g. via --disable-libunistring at configure-time? Being flexible may be useful here, if for some reason a user may not want or need libunistring or wishes to retain the old behaviour of htop. So in other words, whether libunistring would be optional or become mandatory. (As said, I don't mind either way, but I think it may be useful to declare this up-front.)

Explorer09 · 2025-02-20T06:19:21Z

@rubyFeedback I'm personally also reluctant to introduce an additional dependency to htop, so yes, I expect the dependency can be turned off with a configure option as you said.

Currently, I'm trying other libraries with similar functionalities, for example libgrapheme, as I discovered some issues with libunistring that look like bugs. I'll post a summary of the issues I found in a later post.

BenBE added enhancement Extension or improvement to existing feature build system 🔧 Affects the build system rather then the user experience dependencies Pull requests that update a dependency file labels Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode "Grapheme Break" support (user-perceived character breaks, with libunistring) #1603

Unicode "Grapheme Break" support (user-perceived character breaks, with libunistring) #1603

Explorer09 commented Feb 8, 2025

BenBE commented Feb 8, 2025

Explorer09 commented Feb 8, 2025

BenBE commented Feb 8, 2025

Explorer09 commented Feb 8, 2025

rubyFeedback commented Feb 19, 2025

Explorer09 commented Feb 20, 2025

Unicode "Grapheme Break" support (user-perceived character breaks, with libunistring) #1603

Unicode "Grapheme Break" support (user-perceived character breaks, with libunistring) #1603

Comments

Explorer09 commented Feb 8, 2025

BenBE commented Feb 8, 2025

Explorer09 commented Feb 8, 2025

BenBE commented Feb 8, 2025

Explorer09 commented Feb 8, 2025

rubyFeedback commented Feb 19, 2025

Explorer09 commented Feb 20, 2025