-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversion to scalar value string #345
Comments
Since a surrogate is defined as being a code point, not a code unit, the very presence of the word "surrogate" in that algorithm automatically implies that the string must be first converted to a code point representation (as per the second normative paragraph in the "strings" section), which itself transforms surrogate (code unit) pairs into the corresponding non-BMP scalar value. Therefore, all remaining surrogates are unpaired surrogates. But I think that's made very clear in the note immediately below that paragraph. |
You are right. I followed the link to the definition of surrogate, and mixed it up with code unit. Thanks for the explanation. There is a feeling of uneasiness though, since, what Infra calls a surrogate is a surrogate pair in Unicode (chapter 3.8). And a single surrogate (which only exists as high/low surrogate in Unicode) has a Unicode 32 bit code point, which it hasn't in Infra. It lives just as a code unit there. People familiar with Unicode will have many occasions to get lost in the difference of definitions. |
I don't get this right. Infra defines a surrogate as being a code point between U+D800 and U+DFFF. And a code point is a Unicode code point. Which means it is not a character? Because in Unicode a code point is a number https://www.unicode.org/versions/Unicode13.0.0/ch01.pdf#I1.8857 . So it is a number between 0xD800 and 0xDFFF. OK fine. Then it is a code unit, too. And a string is an array of code units. Then why on earth does my example string
not consist of three surrogates? That are subject to replacement during conversion to a value string. I don't understand this. PS. @andreubotella responded to this post while I was still editing. Please look at previous versions (available under ...) to fully understand his answer. |
Infra's "surrogate" corresponds to Unicode's "surrogate code point" (not to Unicode's "surrogate pair"). Now, Infra doesn't have a definition of "surrogate pair", but the usages in non-normative text corresponds to Unicode's definition, which refers to code units. Those uses are somewhat informal, and maybe should be linked to Unicode's definition.
Unicode defines a mapping between numbers ("code points") and meanings (...sorta). The term "character" is often used to refer to those meanings, but since the mapping is one-to-one, you can use "code point" and "character" interchangeably, as we do in Infra:
A code unit is its own specific type of number which is limited to the range between 0 and 0xFFFF, inclusive (this definition of code unit is different from Unicode's, see Unicode 2.5). A code point is its own specific type of number which is limited to the range between 0 and 0x10FFFF, inclusive. An Infra string is defined as a sequence of code units, whether they be valid UTF-16 or not. Additionally, an Infra string can be converted (implicitly, even) into a sequence of code points – a conversion which transforms each code unit to its corresponding code point, except for surrogate code unit pairs. So, a string containing the following code units:
would be converted to a sequence of code points:
You might argue that having U+D800 there is a violation of Unicode, but the web started using Unicode before surrogates were a thing, and changing this behavior might break existing sites. And finally, if you were to transform that string to a scalar value string, you'd start from the code point conversion and replace any remaining surrogate code points:
Edit: The reason why you'd start from the code point conversion is because "replace any surrogates with U+FFFD" talks about "surrogates", which Infra defines as code points, and "U+FFFD", which is the representation commonly used to refer to code points. The conversion is implicit, and maybe we should work on that (on #319 perhaps?), but the usage of code point in that algorithm forces that interpretation. |
@andreubotella First I thought one should close this issue. But I think one should leave it open. There are some details to clarify. |
I revisited this issue to analyze what was difficult to understand for me.
No. A number (value in general) is not the same as the memory unit (or abstraction thereof) containing it. This breaks the argument chain. But what holds for me, holds likewise for the standard. If it declares a surrogate as being a value, it should not use it for (code-)units containing them. Mixing this up in one phrase is seen here:
Things like surrogate pair or isolated surrogate do not make any sense given Infra's definition. IMO not keeping a clear wording here is particularly misleading, because other leading standards Unicode or ECMA-262 identify a surrogate with a unit containing a value from the surrogates range, see for example https://www.unicode.org/versions/Unicode13.0.0/ch23.pdf#M9.27720.Heading.132.Surrogates.Area. To illustrate how far Unicode differs from Infra in this respect, note that a string containing a surrogate (Infra) is ill-formed in Unicode, while surrogates (Unicode) are common and most often correctly used characters in UTF-16. I wonder whether one is not better off by aligning one's terms consistently with these big documents. Of course, technically, one can establish one's own language, for whatever reasons. |
It declares a surrogate as being a code point within a specific range of code points (code points are also signaled using the U+ syntax). And it defines that when a string is discussed in the context of code points it does not contain surrogate pairs. This is further explained in the non-normative text. I think it holds together well. I suppose we might take editorial PRs here, but it really depends on what they would say. I'm personally quite happy with the model we have here as it works really well for strings within the context of the web platform. |
In Section 4.6 I read
This is technically correct, but, I guess, not what you wanted to express, though, especially in the light of the following note.
Lets look at the example given before, where a string contains 3 surrogates:
The first two combine ordinarily to a character called a 'PILE OF POO. Only the third (U+D800) is not part of a surrogate pair, and, thus, subject to substitution by U+FFFD, as suggested by Unicode chapter 5.22. The chosen wording in the Infra standard demands to substitute all 3 characters, instead. Is that really what you want? And to what end??
I suggest to change the cited text to
To convert a string into a scalar value string, replace any unpaired surrogates with U+FFFD.
Wolf Lammen
PS I was notified by @andreubotella that I mix up the definition of surrogate with that of Unicode. So read with care!
The text was updated successfully, but these errors were encountered: