-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cursor location seems incorrect following zero-width-joined emoji #3810
Comments
Yeah it's on my TODO list to add the ~1400 such sequences from https://unicode.org/emoji/charts/emoji-zwj-sequences.html If you wish to add support for them PRs are welcome. More general and complete grapheme clustering: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries |
yep, my assessment is similar. i'm looking at trying to implement unicode faithfully in this regard, at which point i'll start seeing which terminals match it. |
Yes, that's another item for my endless TODO list. It's never been worth I would be willing to donate kitty's gen-wcwidth.py to such a project, |
Sorry, I don't know the technical details, but I have an issue. Step to reproduce:
Debug Details
|
This seems related to my issue, so I'm posting here. I just switched to gnome-terminal which has inconsistent rendering of emojis. At first I was pleased to see that kitty renders them properly, but then I noticed that the cursor positioning is problematic. As it stands, this is actually less usable than gnome-terminal. Screencast.from.13-03-22.16.07.38.mp4 |
That will be because whatever terminal program you are running is using a different width calculation from what kitty uses, which is based on the unicde standard. And unless your emoji is using zwj it is completely unrelated so post elsewhere. |
@latipun7: No, this is caused by zsh not supporting ZWJ. If you run |
@kovidgoyal @dankamongmen Just found this bug after working on a unicode library designed to handle this sort of thing. It winds up being fairly easy, and AFAICT my solution should be decently efficient (binary encoding of a codepoint-based Trie). The data for just emoji sequences (both zwj and otherwise) ends up at ~48kb, with a decent amount of extra data that kitty probably wouldn't need. If there's interest, I'd be happy to provide suggestions for how it could be implemented in C |
On Mon, Jun 13, 2022 at 08:56:55AM -0700, Max Sharnoff wrote:
@kovidgoyal @dankamongmen Just found this bug after working on a unicode library designed to handle this sort of thing. It winds up being fairly easy, and AFAICT my solution should be decently efficient (binary encoding of a codepoint-based Trie). The data for just emoji sequences (both zwj and otherwise) ends up at ~48kb, with a decent amount of extra data that kitty probably wouldn't need.
If there's interest, I'd be happy to provide suggestions for how it could be implemented in C
Sure, I am always happy to discuss design ideas. By this sort of thing
do you mean wcswidth() in general or looking up emoji combining
sequences in particular. For zwj+emoji support in kitty one needs basically:
1) Adding zwj+emoji support to wcswidth_step() which basically tells you how the width
of a string changes when you add a codepoint to it.
2) Changing kitty's cell data structure to support infinite length
codepoint strings. This will likely be a auxilliary hash mapping shorts
to heap allocated codepoint arrays. The shorts will be stored per cell.
This will reduce cell memory usage by 4 bytes at the cost of making
looking up the text in a cell more expensive (which is fortunately not a
frequent operation).
|
I was looking just at emoji sequences -- given that they're all suppposed to have the same width (in practice, font combinations can mess this up IIRC - e.g., defaulting to text presentation when the Emoji spec says otherwise, which is often a width of 1 column). For adding support: I was separating into grapheme clusters first, but the operation is mostly the same. If I were implementing this for kitty, I'd use the same sort of trie of codepoints, where each node stores (a) whether the sequence up to that point is one of the zwj sequences, or (b) how wide it would be otherwise. Changing I don't already know anything about kitty's cell datastructure, but I'd just add: if you're only using the listed zwj sequences, the longest sequence in the set is currently only 10 codepoints -- it doesn't need to be unbounded. But there are a small enough number of them (currently 1349) that packing them into existing empy space in the datatypes may be possible. |
On Mon, Jun 13, 2022 at 02:27:19PM -0700, Max Sharnoff wrote:
> do you mean wcswidth() in general or looking up emoji combining sequences in particular
I was looking just at emoji sequences -- given that they're all *suppposed* to have the same width (in practice, font combinations can mess this up IIRC - e.g., defaulting to text presentation when the Emoji spec says otherwise, which is often a width of 1 column).
In a terminal context width calculations must be font independent.
For adding support:
I was separating into grapheme clusters first, but the operation is mostly the same. If I were implementing this for kitty, I'd use the same sort of trie of codepoints, where each node stores (a) whether the sequence up to that point is one of the zwj sequences, or (b) how wide it would be otherwise. Changing `wcswidth_step()` gets a little tricky, because maybe it's implicitly expected not to decrease the width. Another tricky spot is handling cases like "this is most of a long zwj sequence, but it's missing the end so it's actually mutiple separate emoji now" -- there's a bit of extra work if those need to be separated into cells in post (and possibly re-parsed for new zwj sequence starters - I'm not sure whether any cases of this are possible).
wcswidth_step() must be able to reduce width to support VS16 which
converts emoji to text presentation. So that's not an issue.
I don't already know anything about kitty's cell datastructure, but I'd just add: if you're only using the listed zwj sequences, the longest sequence in the set is currently only 10 codepoints -- it doesn't need to be unbounded. But there are a small enough number of them (currently 1349) that packing them into existing empy space in the datatypes _may_ be possible.
There is no empty space in the struct at the moment, however given the
small number of such sequences one could encode them into the existing
codepoint to mark mapping (kitty maps all combining unicode marks to
numbers since there are only about 2000 odd of them they fit in a short.
One could easily add another 2000 marks to this mapping. However, the
question is how future proof this is and whether its not better to just
allow infinite length sequences now.
|
fwiw @sharnoff notcurses would need pretty much the exact same thing as @kovidgoyal mentions for his |
Describe the bug
Emoji joined via a ZWJ display properly, but the cursor is moved too far. An example:
as you can see, the cursor report indicates that we have moved four -- which is accurate, as we have indeed moved four cursor positions forward. we only ought have moved two.
FWIW, it's nice that kitty actually implements this; it's ahead of most terminals in this regard.
To Reproduce
Steps to reproduce the behavior:
Screenshots
Environment details
Additional context
Happens the same way.
The text was updated successfully, but these errors were encountered: