Description
It has always bothered me that strings have to define reverseind
. But figuring out a correct generic definition for this function has eluded me – until now. I think I've finally figured it out:
reverseind(s,i)
gives the index ins
of the character beginning at bytei
inreverse(s)
.- Then
ncodeunits(s)-i+1
is index of the end of that character ins
andncodeunits(s)-i+2
is the index of the beginning of the next character ins
(or the index right after the end ofs
). - Therefore
prevind(s, ncodeunits(s)-i+2)
is always the index of the character in question ins
. In other words, this is a generic expression forreverseind(s,i)
in terms ofprevind
andncodeunits
.
Edit: I've replaced sizeof
with ncodeunits
as suggested below.
This does actually work out:
julia> s = "∀ x ∃ y"
"∀ x ∃ y"
julia> [reverseind(s,i) for i=1:sizeof(s)]
11-element Array{Int64,1}:
11
10
7
7
7
6
5
4
1
1
1
julia> [prevind(s, sizeof(s)-i+2) for i=1:sizeof(s)]
11-element Array{Int64,1}:
11
10
7
7
7
6
5
4
1
1
1
The only problem with this is that it requires a generic definition of sizeof(s)
which does not exist, and arguably should not exist for string types that may not be backed by bytes in the usual way. Instead, I would suggest using nextind(s, endof(s))
and giving this some generic function name. This function is something that specific string types may want to overwrite, but that's much easier to do since for typical string types, it's just the storage size of the string.
cc: @stevengj
Activity
oxinabox commentedon Nov 15, 2017
I am now convinced this works, purely empirically.
Fuzzer:
no errors.
stevengj commentedon Nov 15, 2017
sizeof
is wrong here, because it is the size in bytes. e.g. it will fail for a UTF-16 array.What you want is the number of code units, which I think should be
nextind(s,endof(s))-1
.Since we have a
codeunit(s, i)
function, it makes sense to have alengthcodeunits(s)
function or similar that gives the maximum index. Not sure of a good name.StefanKarpinski commentedon Nov 15, 2017
Yes, that's right. So the generic definitions would be:
and you'd have these specific definitions for speed:
I think everything else falls out of the definition of
prevind
which is complex forString
andUTF16String
but just doesi-1
forUTF32String
.stevengj commentedon Nov 15, 2017
I would just define
ncodeunits(s::UTF16String) = length(s.data)
, but yes.StefanKarpinski commentedon Nov 16, 2017
[Not breaking, so removing the
triage
label.]StefanKarpinski commentedon Nov 16, 2017
This relies on the indices into a string being the same as its code unit indices. We haven't formally required that before, but I think that we should – that's how all actual string types we've seen work and it's hard to imagine any other way to do this.
stevengj commentedon Nov 16, 2017
It won't work if we define a
StringIndex
type (#9297), because then the- 1
might not be defined.StefanKarpinski commentedon Nov 16, 2017
I suspect we're not going to move ahead with #9297, but if we do then string types will just have to define
ncodeunits
directly and we can't provide a fallback for them anymore.StefanKarpinski commentedon Nov 20, 2017
I've realized that there's a complication here. The contract of
reverseind
is essentially the identityHowever, there's an assumption baked into this which is a bit of an issue: the type and encoding of
reverse(s)
. Makingreverseind
generic in the way I've proposed assumes that the type and encoding ofreverse(s)
is the same as that ofs
. Until nowreverse(::String)
has returned aRevString
, which has made changing this behavior harder than expected since the generic definition does not work. We can fix this particular issue, but this raises a basic question:reverse(s)
have the same type and encoding ass
? OR...reverse(s)
return aString
– i.e. normalize to standard string type?We can only have a correct generic fallback for
reverseind
for one choice, not both – since they dictate different behaviors forreverseind
. I'm inclined to go with option 1 for a couple of reasons:reverseind
is possible for the same type, under fairly reasonable assumptions that are valid, e.g. for UTF-8 and UTF-16 (and trivially UTF-32).reverseind
betweenString
and a generic encoding isn't possible in the same way (as far as I can tell).stevengj commentedon Nov 20, 2017
@StefanKarpinski, recall that this was discussed in #23612, and in consequence we documented that
reverse(s)
always returns aString
. (That would also argue against a genericreverseind
.)StefanKarpinski commentedon Nov 20, 2017
I think that requiring people to define
reverseind
for custom string types is pretty unfortunate. It's a complicated and very weird function, yet string types don't work correctly without defining it. That's pretty bad. The other approach that's possible is to double down onRevString
and always havereverse(s)
returnRevString(s)
.16 remaining items