Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode "Grapheme Break" support (user-perceived character breaks, with libunistring) #1603

Open
Explorer09 opened this issue Feb 8, 2025 · 6 comments
Labels
build system 🔧 Affects the build system rather then the user experience dependencies Pull requests that update a dependency file enhancement Extension or improvement to existing feature

Comments

@Explorer09
Copy link
Contributor

This is a discussion of a feature that would use a library.

htop often needs breaks of file paths of command lines, as well as other strings that may contain arbitrary Unicode text.

In discussions of #462 and #854, about shortening a path of a working directory for display, I think it becomes apparent that we need to support the Unicode algorithm of user-perceived character boundaries.

In Unicode, this is defined in UAX 29 section 3:
https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
with a normative database:
https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt

Not to be confused with Unicode line breaking algorithm (UAX 14).

What are the "Grapheme Cluster Boundaries" in Unicode and why are they important for htop?

When htop needs to truncate a file path or a command line argument string, it is supposed to remove "characters" from the path string until the string fits the given terminal column width. However, Unicode characters (code points) do not always have one-to-one relationships to the "characters" that user could perceive as one unit.

Special cases can include:

  • Combining character sequences
  • Character sequences with ZWJ (Zero Width Joiner, U+200D), commonly found in emoji sequences.
  • Regional indicator symbols (a.k.a. flag emojis). These regional indicator character should be in pairs and break cannot happen within each pair.
  • Languages that use a syllable as the minimum grapheme unit. This is the least concern for htop but I mention for completeness.

I don't think people would like a series of United Kingdom flag emojis (🇬🇧🇬🇧🇬🇧🇬🇧) suddenly transform into Bulgaria flag emojis plus a letter B (🇧🇬🇧🇬🇧🇬🇧) when characters are cut at a wrong location. This is just an example.

The grapheme cluster boundary algorithm requires a database, and I'm not wishing to embed the whole database into htop. What I am suggesting is to incorporate libunistring, a library from GNU that had the database for our needs. Specifically the unigbrk.h APIs.

The license for libunistring is "LGPLv3+ or GPLv2+" (dual-licensed), and so is compatible to htop's license (GPLv2+).

@BenBE
Copy link
Member

BenBE commented Feb 8, 2025

Is using the library doable in a way that makes runtime binding to this library feasible?
Or would this have to be a compile-time configure option?

@BenBE BenBE added enhancement Extension or improvement to existing feature build system 🔧 Affects the build system rather then the user experience dependencies Pull requests that update a dependency file labels Feb 8, 2025
@Explorer09
Copy link
Contributor Author

@BenBE It's likely a configure option with a possibly of statically linking it. The issue I see is the library's soname version is not quite stable as I would expect. Example:

https://packages.debian.org/bookworm/libunistring2
https://packages.debian.org/trixie/libunistring5

@BenBE
Copy link
Member

BenBE commented Feb 8, 2025

How stable is its API? Or will we be chasing API change after API change for this lib?

@Explorer09
Copy link
Contributor Author

@BenBE The API I was considering right now is uc_graphemeclusterbreak_property(), that is, simply retrieving the "Grapheme_Cluster_Break" property from the database. The other functions can't work directly with htop's RichString so it's likely I need to build wrappers for them. (They can work with Unicode strings stored in uint8_t, uint16_t or uint32_t arrays but do not support a custom iterator that I need for RichString.)

@rubyFeedback
Copy link

Compiling libunistring (https://ftp.gnu.org/gnu/libunistring/libunistring-1.3.tar.xz) works fine for me, never had an issue, so this option would not affect me negatively (I think).

I do, however had, have one question:

  • Would it be possible to retain htop's old behaviour, e. g. via --disable-libunistring at configure-time? Being flexible may be useful here, if for some reason a user may not want or need libunistring or wishes to retain the old behaviour of htop. So in other words, whether libunistring would be optional or become mandatory. (As said, I don't mind either way, but I think it may be useful to declare this up-front.)

@Explorer09
Copy link
Contributor Author

@rubyFeedback I'm personally also reluctant to introduce an additional dependency to htop, so yes, I expect the dependency can be turned off with a configure option as you said.

Currently, I'm trying other libraries with similar functionalities, for example libgrapheme, as I discovered some issues with libunistring that look like bugs. I'll post a summary of the issues I found in a later post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build system 🔧 Affects the build system rather then the user experience dependencies Pull requests that update a dependency file enhancement Extension or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants