Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Index Conversion from Text to TextRef #129

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jbdyn
Copy link
Contributor

@jbdyn jbdyn commented Jun 29, 2024

Hey @davidbrochart 👋

I played around with some emojis in Text and noticed that insertion is working different than expected:

🐍 test script
from pycrdt import Doc, Text

## setup

ydoc = Doc()
ytext = Text()
ydoc["text"] = ytext

state = ""            # track state of ytext


def callback(event):
    """Print change record"""
    global state

    new_state = str(event.target)
    delta = str(event.delta)
    print(f"{delta}: '{state}' -> '{new_state}'")

    # update current state
    state = new_state


ytext.observe(callback)


## Manipulate Text

print("Insert and delete single emoji '🌴'")
# works as expected
ytext.insert(0, "🌴")
assert state == "🌴"

# given index is for Unicode code points
# but callback returns length of individual bytes in delta
del ytext[0:1]
assert state == ""

print("\nInsert '🌴abcde' sequentially")
for c, char in enumerate("🌴abcde"):
    ytext.insert(c, char)
assert state == "🌴abcde"
Insert and delete single emoji '🌴'
[{'insert': '🌴'}]: '' -> '🌴'
[{'delete': 4}, {'insert': ''}]: '🌴' -> ''

Insert '🌴abcde' sequentially
[{'insert': '🌴'}]: '' -> '🌴'
[{'retain': 4}, {'insert': 'a'}]: '🌴' -> '🌴a'
[{'retain': 4}, {'insert': 'b'}]: '🌴a' -> '🌴ba'
[{'retain': 4}, {'insert': 'c'}]: '🌴ba' -> '🌴cba'
[{'retain': 4}, {'insert': 'd'}]: '🌴cba' -> '🌴dcba'
[{'retain': 5}, {'insert': 'e'}]: '🌴dcba' -> '🌴decba'

In the Python code, one gives the index for Unicode code points, however

TextRef structure internally uses UTF-8 encoding and its length is described in a number of bytes rather than individual characters

[source]

So, I put in some thought to adapt the given index to the UTF-8 encoded string with this PR:

Insert and delete single emoji '🌴'
[{'insert': '🌴'}]: '' -> '🌴'
[{'delete': 4}]: '🌴' -> ''

Insert '🌴abcde' sequentially
[{'insert': '🌴'}]: '' -> '🌴'
[{'retain': 4}, {'insert': 'a'}]: '🌴' -> '🌴a'
[{'retain': 5}, {'insert': 'b'}]: '🌴a' -> '🌴ab'
[{'retain': 6}, {'insert': 'c'}]: '🌴ab' -> '🌴abc'
[{'retain': 7}, {'insert': 'd'}]: '🌴abc' -> '🌴abcd'
[{'retain': 8}, {'insert': 'e'}]: '🌴abcd' -> '🌴abcde'

However, I am not sure how to deal with the numbers returned in event.delta upon TextEvents, as they are also based on the UTF-8 encoded form and thereby can be off for the Python string representation. (My use case: keeping Text in sync with contents of the Textual TextArea widget.)

Should the user deal with that with own code? Should Text try to give the numbers for the Python string repr? Or should Text be capable of handling rich text as TextRef does:

TextRef offers a rich text editing capabilities (it’s not limited to simple text operations). Actions like embedding objects, binaries (eg. images) and formatting attributes are all possible using TextRef.

[source]

I also thought about limiting Text to inserted values for which len(val) == len(val.encode()), but this does not feel right to me.

@jbdyn jbdyn marked this pull request as draft June 29, 2024 05:41
@davidbrochart
Copy link
Collaborator

Hi @jbdyn,

Thanks for looking into this. Indeed Python's string encoding is Unicode and on the Rust side Yrs uses UTF-8. I'm not sure we should do any automatic conversion, because as you said the indices in the events are UTF-8 based. Maybe we should let the user deal with index conversion manually when dealing with these kinds of characters? For instance in your example you could do:

del ytext[0:len(str(ytext).encode())]
for c, char in enumerate("🌴abcde"):
    ytext.insert(len(str(ytext).encode()), char)

or just:

del ytext[:]
for char in "🌴abcde":
    ytext += char

In Yrs an offset_kind can be passed to a Doc, which can be Bytes or Utf16, but I'm not sure it would solve this issue?

@jbdyn
Copy link
Contributor Author

jbdyn commented Jun 29, 2024

Maybe we should let the user deal with index conversion manually when dealing with these kinds of characters?

This should be fine when the user knows that the index/slice given to Text corresponds to UTF-8 encoded content and not to Unicode. But if not, this can become a gotcha as Text seemingly behaves like a Python Unicode string str.

For this, however, having the raw encoded content of TextRef accessible via Python would be quite nice. With that, one does not need to manage an encoded version of a string alongside Text or has to re-encode the returned Unicode string on every insert/delete. My rationale:

Let (i, n) be the index and length, respectively, for Unicode and (j, m) for UTF-8.
The tuple (i, n) is known by the user or used by the user-facing text editor, (j, m) is given in a TextEvent from TextRef.
Let unicode = str(ytext) be the Unicode version of TextRef, the tuples can be converted with

# 1, Unicode -> UTF-8
j = len(unicode[:i].encode())
m = len(unicode[i:i+n].encode())

# 2, UTF-8 -> Unicode
i = len(utf8[:j].decode())
n = len(utf8[j:j+m].decode())

For the second part, if I would not have utf8 directly accessible from Text, I would need to do

# 2, UTF-8 -> Unicode
utf8 = unicode.encode()
i = len(utf8[:j].decode())
n = len(utf8[j:j+m].decode())

on every change. As the bytes are already there in TextRef, the user should not be forced to recompute it, right?

In Yrs an offset_kind can be passed to a Doc, which can be Bytes or Utf16, but I'm not sure it would solve this issue?

It seems not to. There is an example for the usage of OffsetKind. When I replace (3 bytes, single char width) with 🌴 (4 bytes, double char width), the indexing in TextRef::insert does not line up with the indexing in Python as shown here with an !:

// in Rust
"Hi ★! to you"
 ----^ index 4

"Hi 🌴! to you"
 -----^ index 5

In Python strings, however, the ! is in both cases on index 4:

star = "Hi ★! to you"
palm = "Hi 🌴! to you"

assert star[4] == "!"
assert palm[4] == "!"

@jbdyn
Copy link
Contributor Author

jbdyn commented Jun 30, 2024

Something else crossed my mind:

Instead of converting the indices via len(...encode()/decode()) on every change, one could maintain and update an array with chunk ranges. Thereby, the index given to Text is a "chunk index" and less encoding/decoding is needed (just for encoding inserted values, as I see it).
So "Hi 🌴! to you" could be represented as

chunks = [
    (0, 1),
    (1, 2),
    (2, 3),
    (3, 7), # <- 🌴
    (7, 8),
    (8, 9),
    (9, 10),
    (10, 11),
    (11, 12),
    (12, 13),
    (13, 14),
    (14, 15)
]

Having that, we could do the following index conversion for single chunks

j, j_end = chunks[i]
m = j_end - j

# (j, m) corresponds to a specific chunk anyway, because the user or user-facing text-editor
# also only operates in chunks
i = chunks.index((j, j+m))
n = 1

This implementation here is highly inefficient on memory, but could be made sparse. Also it does not account for slices including multiple or nested chunks, but this should also be solvable.

I can imagine two ways getting and keeping chunks in sync with the content of Text/TextRef:

  1. Tree-Sitter might be suitable here as it can perform partial parsing and the queries can be defined by the user so that there is the possibility to have a custom "rich text codec" (How is formatting defined? Where does an image start and end? etc.).
  2. TextRef returns indices for Diffs, which are basically the chunks we are looking for (?).

This approach might seem a bit involved, but I still find it tempting to implement and at least wanted to share the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants