-
-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve compatability of unicode grapheme width #4223
Comments
Can you share the codepoints from the problematic sequences? That will make it easier for me to look into this. |
FWIW iterm2 has similar behavior to wezterm: have you tried helix and kakoune on that terminal? I'm open to removing the Line 1008 in 99c9613
|
Note that kitty's current behavior with ZWJ is wrong. It's on my TODO list to implement, see kovidgoyal/kitty#3810 at which point it will render in two cells as well. IMO, rendering in two cells for extended ZWJ sequences is the correct behavior. I have not implemented this yet, partially because it's a fair bit of work in kitty internals and partially because as you point out, most terminal emulators don't implement it and almost no terminal programs implement it, so implementing it would break many things. |
I'm leaning towards expanding the unicode version escape sequences (https://wezfurlong.org/wezterm/config/lua/config/unicode_version.html) to allow applications to select the behavior they desire; details need designing. There's a cell structure used by wezterm that builds on a tiny string type that uses a single bit to represent the column width; that will take a bit of effort to adjust for this concept. |
I would recommend against doing that. It means, for example, one can no longer cat text into a terminal and have predictable results. Imagine a table in a markdown file, its columns will no longer be aligned depending on which terminal you cat it into and even details of the terminals internal state. When it comes to width considerations, really the only workable longterm solution is for all terminals to converge to a common standard for wcswidth() |
Yeah this is really the crux of the issue. I agree that for ZJW emojis specifically width 2 makes sense (not so much for other graphemes tough IMO) but really as an app developer I can't take the specifics of various emulators into account. It would not be ideal to essentially test for know TERM_PROGRAMM values at startup to figure out what width to use (even if I did implemennt that, such a solution is pretty brittle over ssh). It's already a bit of a pain that kitty defaults to respecting emoji presentation/unicode 14, my code supports that but I have to figure out of if the emulator does and ultimately expose a user setting for that. I think the escape sequences approach is pretty nice because atleast it's much more robust regarding ssh and similar. If all terminals that supported more extensive unicode handling (emoji presentation, zwj emojis,...) allowed me to opt out by sending a escape sequences on startup that would atleast prevent my tui from breaking depending on the emulator it's running on. Ideally there would be some mechanism to go the other way too (the tui can figure out what the emulator supports). Otherwise I would just keep my tui at the lowest common denominator (for example alacritty has no plans to support zwj emojis or even emoji presentation AFAIK). Terminfo would seem like a good candidate but that also has the ssh problem so that is a bit brittle but atleast an improvement over querying TERM_PROGRAM. |
On Tue, Aug 29, 2023 at 11:42:01PM -0700, Pascal Kuthe wrote:
> I would recommend against doing that. It means, for example, one can no longer cat text into a terminal and have predictable results. Imagine a table in a markdown file, its columns will no longer be aligned depending on which terminal you cat it into and even details of the terminals internal state.
>
> When it comes to width considerations, really the only workable longterm solution is for all terminals to converge to a common standard for wcswidth()
Yeah this is really the crux of the issue. I agree that for ZJW emojis specifically width 2 makes sense (not so much for other graphemes tough IMO) but really as an app developer I can't take the specifics of various emulators into account. It would not be ideal to essentially test for know TERM_PROGRAMM values at startup to figure out what width to use (even if I did implemennt that, such a solution is pretty brittle over ssh). It's already a bit of a pain that kitty defaults to respecting emoji presentation/unicode 14, my code supports that but I have to figure out of if the emulator does and ultimately expose a user setting for that.
As an app developer your job is to write code that does the right width
calculation. If this causes problems with X terminal emulator, that is a
bug in that terminal emulator. Please don't try to do terminal emulator
capability detection and workarounds, that way lies insanity. When it
comes to width you will have to try to workaround not just terminal
emulator differences, but also terminal multiplexer ones. It's simply
not doable robustly. Do the right thing, tell users who have problems
with X emulator to report the issue to that emulator.
In this case the right thing is pretty clear -- two cells. So everybody
should converge on that. If XYZ terminal emulator doesn't want to do
that, its not your job to accommodate that wrong decision. Wezterm is
doing the right thing here, kitty is not, and I say this as the creator
of kitty :) So use two as the width for such sequences in your code.
@wez: I say you dont need to do anything, you are already doing the
right thing, keep doing it.
|
Looking into this more, it's a PITA to undo the grapheme clustering work, so I've decided not to do it. wezterm now reports "permanently enabled" to a DECRQM query from an application that wants to reason about the grapheme clustering behavior. wezterm, contour and (soon: there's a PR open) foot implement this query. Some terminals support setting this parameter at runtime. You can read more about this sequence over in https://github.com/contour-terminal/terminal-unicode-core |
This isn't currently possible, but #4223 will introduce that possibility. This commit causes width > 2 to "overflow" into the heap-stored variant. Sequences that are long enough to produce a column width > 2 are likely to already be long enough in bytes to overlow into the heap-stored variant anyway.
I mentioned this on matrix. I personally consider this a bug but it doesn't really fit the the template and depending on your point of view it could also be viewed as a feature request.
While trying to fix the way helix render multi codepoint graphemes (currently truncating afte rthe first char) I noticed that my fix breaks rendering on wezterm. The reason is that wezterm changes the traditional cell width that would be calculated by wcchar (which would be 5 for for a mean_heart_woman emoji for example) by capping the width of a single grapheme to two.
This breaks existing applications (like kakune) and puts me as an application author in a rough spot where I can either use traditional width calculation and get the correct result on most emulators (kitty, alacrity, xterm, gnome terminal, ...) or use the width calculation used by wezterm and get the correct result on wezterm.
Using the wrong definition of width results in horrible visual artifacts (random highlights, disappearing and ghost characters,...).
Traditionally terminal emulators have simply computed the width of unicode cosepoints individually (merging zero width codepoints) without any grapheme segmentation. I personally feel that this is correct because the unicode standard only defines width on a codepoint level (with the exception of emoji representation which affects two codepoints at once buy that still kind of works because the variant selectors is zero width so its atleast a single cell).
The unicode standard does not define any notion of width for graphemes as far as I am aware (basically saying its font dependent). That makes a lot if sense because some graphemes can get really crazy (able to cover multiple lines, get super wide etc.). The width limit of 2 works well for emojis but not for other graphemes (to be fair other graphemes are kind of nieche anyway).
The kitty emulator handles this by keeping the cell width definition unchanged.The unicode segmentation is then simply handled at the font level so multi grapheme emojis still get merged but they simply endup with some trailing empty space.
Would you perhaps consider adoption a similar mechanism for compatibility? If you prefer the current behaviour perphase some kind of escape code based negotiation could be possible like there is for unicode version support?
Right now I am in a bit of an awkward spot where I as an application developer habe to choose which emulator I want to support properly.
The text was updated successfully, but these errors were encountered: