-
-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode "Grapheme Break" support (user-perceived character breaks, with libunistring) #1603
Comments
Is using the library doable in a way that makes runtime binding to this library feasible? |
@BenBE It's likely a https://packages.debian.org/bookworm/libunistring2 |
How stable is its API? Or will we be chasing API change after API change for this lib? |
@BenBE The API I was considering right now is |
Compiling libunistring (https://ftp.gnu.org/gnu/libunistring/libunistring-1.3.tar.xz) works fine for me, never had an issue, so this option would not affect me negatively (I think). I do, however had, have one question:
|
@rubyFeedback I'm personally also reluctant to introduce an additional dependency to htop, so yes, I expect the dependency can be turned off with a configure option as you said. Currently, I'm trying other libraries with similar functionalities, for example libgrapheme, as I discovered some issues with libunistring that look like bugs. I'll post a summary of the issues I found in a later post. |
This is a discussion of a feature that would use a library.
htop often needs breaks of file paths of command lines, as well as other strings that may contain arbitrary Unicode text.
In discussions of #462 and #854, about shortening a path of a working directory for display, I think it becomes apparent that we need to support the Unicode algorithm of user-perceived character boundaries.
In Unicode, this is defined in UAX 29 section 3:
https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
with a normative database:
https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt
Not to be confused with Unicode line breaking algorithm (UAX 14).
What are the "Grapheme Cluster Boundaries" in Unicode and why are they important for htop?
When htop needs to truncate a file path or a command line argument string, it is supposed to remove "characters" from the path string until the string fits the given terminal column width. However, Unicode characters (code points) do not always have one-to-one relationships to the "characters" that user could perceive as one unit.
Special cases can include:
I don't think people would like a series of United Kingdom flag emojis (🇬🇧🇬🇧🇬🇧🇬🇧) suddenly transform into Bulgaria flag emojis plus a letter B (🇧🇬🇧🇬🇧🇬🇧) when characters are cut at a wrong location. This is just an example.
The grapheme cluster boundary algorithm requires a database, and I'm not wishing to embed the whole database into htop. What I am suggesting is to incorporate libunistring, a library from GNU that had the database for our needs. Specifically the
unigbrk.h
APIs.The license for libunistring is "LGPLv3+ or GPLv2+" (dual-licensed), and so is compatible to htop's license (GPLv2+).
The text was updated successfully, but these errors were encountered: