-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Index Conversion from Text to TextRef #129
base: main
Are you sure you want to change the base?
Fix Index Conversion from Text to TextRef #129
Conversation
Hi @jbdyn, Thanks for looking into this. Indeed Python's string encoding is Unicode and on the Rust side Yrs uses UTF-8. I'm not sure we should do any automatic conversion, because as you said the indices in the events are UTF-8 based. Maybe we should let the user deal with index conversion manually when dealing with these kinds of characters? For instance in your example you could do: del ytext[0:len(str(ytext).encode())]
for c, char in enumerate("🌴abcde"):
ytext.insert(len(str(ytext).encode()), char) or just: del ytext[:]
for char in "🌴abcde":
ytext += char In Yrs an offset_kind can be passed to a |
This should be fine when the user knows that the index/slice given to For this, however, having the raw encoded content of Let # 1, Unicode -> UTF-8
j = len(unicode[:i].encode())
m = len(unicode[i:i+n].encode())
# 2, UTF-8 -> Unicode
i = len(utf8[:j].decode())
n = len(utf8[j:j+m].decode()) For the second part, if I would not have # 2, UTF-8 -> Unicode
utf8 = unicode.encode()
i = len(utf8[:j].decode())
n = len(utf8[j:j+m].decode()) on every change. As the bytes are already there in
It seems not to. There is an example for the usage of // in Rust
"Hi ★! to you"
----^ index 4
"Hi 🌴! to you"
-----^ index 5 In Python strings, however, the star = "Hi ★! to you"
palm = "Hi 🌴! to you"
assert star[4] == "!"
assert palm[4] == "!" |
Something else crossed my mind: Instead of converting the indices via chunks = [
(0, 1),
(1, 2),
(2, 3),
(3, 7), # <- 🌴
(7, 8),
(8, 9),
(9, 10),
(10, 11),
(11, 12),
(12, 13),
(13, 14),
(14, 15)
] Having that, we could do the following index conversion for single chunks j, j_end = chunks[i]
m = j_end - j
# (j, m) corresponds to a specific chunk anyway, because the user or user-facing text-editor
# also only operates in chunks
i = chunks.index((j, j+m))
n = 1 This implementation here is highly inefficient on memory, but could be made sparse. Also it does not account for slices including multiple or nested chunks, but this should also be solvable. I can imagine two ways getting and keeping
This approach might seem a bit involved, but I still find it tempting to implement and at least wanted to share the idea. |
Hey @davidbrochart 👋
I played around with some emojis in
Text
and noticed that insertion is working different than expected:🐍 test script
In the Python code, one gives the index for Unicode code points, however
[source]
So, I put in some thought to adapt the given index to the UTF-8 encoded string with this PR:
However, I am not sure how to deal with the numbers returned in
event.delta
uponTextEvent
s, as they are also based on the UTF-8 encoded form and thereby can be off for the Python string representation. (My use case: keepingText
in sync with contents of theTextual
TextArea
widget.)Should the user deal with that with own code? Should
Text
try to give the numbers for the Python string repr? Or shouldText
be capable of handling rich text asTextRef
does:[source]
I also thought about limiting
Text
to inserted values for whichlen(val) == len(val.encode())
, but this does not feel right to me.