Skip to content

make reverseind generic #24613

Closed
Closed
@StefanKarpinski

Description

@StefanKarpinski
Member

It has always bothered me that strings have to define reverseind. But figuring out a correct generic definition for this function has eluded me – until now. I think I've finally figured it out:

  • reverseind(s,i) gives the index in s of the character beginning at byte i in reverse(s).
  • Then ncodeunits(s)-i+1 is index of the end of that character in s and ncodeunits(s)-i+2 is the index of the beginning of the next character in s (or the index right after the end of s).
  • Therefore prevind(s, ncodeunits(s)-i+2) is always the index of the character in question in s. In other words, this is a generic expression for reverseind(s,i) in terms of prevind and ncodeunits.

Edit: I've replaced sizeof with ncodeunits as suggested below.

This does actually work out:

julia> s = "∀ x ∃ y"
"∀ x ∃ y"

julia> [reverseind(s,i) for i=1:sizeof(s)]
11-element Array{Int64,1}:
 11
 10
  7
  7
  7
  6
  5
  4
  1
  1
  1

julia> [prevind(s, sizeof(s)-i+2) for i=1:sizeof(s)]
11-element Array{Int64,1}:
 11
 10
  7
  7
  7
  6
  5
  4
  1
  1
  1

The only problem with this is that it requires a generic definition of sizeof(s) which does not exist, and arguably should not exist for string types that may not be backed by bytes in the usual way. Instead, I would suggest using nextind(s, endof(s)) and giving this some generic function name. This function is something that specific string types may want to overwrite, but that's much easier to do since for typical string types, it's just the storage size of the string.

cc: @stevengj

Activity

oxinabox

oxinabox commented on Nov 15, 2017

@oxinabox
Contributor

I am now convinced this works, purely empirically.

Fuzzer:


julia> newcheck(s)=[prevind(s, sizeof(s)-i+2) for i=1:sizeof(s)]
newcheck (generic function with 1 method)

julia> oldcheck(s)=[reverseind(s,i) for i=1:sizeof(s)]
oldcheck (generic function with 1 method)

julia> for t in [join(rand(Char, rand(1:100))) for _ in 1:10^5]
	Base.Test.@test(newcheck(t) == oldcheck(t))
end

no errors.

stevengj

stevengj commented on Nov 15, 2017

@stevengj
Member

sizeof is wrong here, because it is the size in bytes. e.g. it will fail for a UTF-16 array.

What you want is the number of code units, which I think should be nextind(s,endof(s))-1.

Since we have a codeunit(s, i) function, it makes sense to have a lengthcodeunits(s) function or similar that gives the maximum index. Not sure of a good name.

StefanKarpinski

StefanKarpinski commented on Nov 15, 2017

@StefanKarpinski
SponsorMemberAuthor

Yes, that's right. So the generic definitions would be:

reverseind(s::AbstractString, i::Int) = prevind(s, ncodeunits(s)-i+2)
ncodeunits(s::AbstractString) = nextind(s, endof(s))-1

and you'd have these specific definitions for speed:

ncodeunits(s::String) = sizeof(s)
ncodeunits(s::UTF16String) = sizeof(s) >> 1
ncodeunits(s::UTF32String) = sizeof(s) >> 2

I think everything else falls out of the definition of prevind which is complex for String and UTF16String but just does i-1 for UTF32String.

stevengj

stevengj commented on Nov 15, 2017

@stevengj
Member

I would just define ncodeunits(s::UTF16String) = length(s.data), but yes.

StefanKarpinski

StefanKarpinski commented on Nov 16, 2017

@StefanKarpinski
SponsorMemberAuthor

[Not breaking, so removing the triage label.]

StefanKarpinski

StefanKarpinski commented on Nov 16, 2017

@StefanKarpinski
SponsorMemberAuthor

This relies on the indices into a string being the same as its code unit indices. We haven't formally required that before, but I think that we should – that's how all actual string types we've seen work and it's hard to imagine any other way to do this.

stevengj

stevengj commented on Nov 16, 2017

@stevengj
Member

It won't work if we define a StringIndex type (#9297), because then the - 1 might not be defined.

StefanKarpinski

StefanKarpinski commented on Nov 16, 2017

@StefanKarpinski
SponsorMemberAuthor

I suspect we're not going to move ahead with #9297, but if we do then string types will just have to define ncodeunits directly and we can't provide a fallback for them anymore.

StefanKarpinski

StefanKarpinski commented on Nov 20, 2017

@StefanKarpinski
SponsorMemberAuthor

I've realized that there's a complication here. The contract of reverseind is essentially the identity

s[reverseind(s,i)] == reverse(s)[i]

However, there's an assumption baked into this which is a bit of an issue: the type and encoding of reverse(s). Making reverseind generic in the way I've proposed assumes that the type and encoding of reverse(s) is the same as that of s. Until now reverse(::String) has returned a RevString, which has made changing this behavior harder than expected since the generic definition does not work. We can fix this particular issue, but this raises a basic question:

  1. Should reverse(s) have the same type and encoding as s? OR...
  2. Should reverse(s) return a String – i.e. normalize to standard string type?

We can only have a correct generic fallback for reverseind for one choice, not both – since they dictate different behaviors for reverseind. I'm inclined to go with option 1 for a couple of reasons:

  • If someone is working with a specific encoding, it's likely for a reason and we should respect that unless they specifically request changing encodings by converting types.
  • An efficient definition of reverseind is possible for the same type, under fairly reasonable assumptions that are valid, e.g. for UTF-8 and UTF-16 (and trivially UTF-32).
  • An efficient generic definition of reverseind between String and a generic encoding isn't possible in the same way (as far as I can tell).
stevengj

stevengj commented on Nov 20, 2017

@stevengj
Member

@StefanKarpinski, recall that this was discussed in #23612, and in consequence we documented that reverse(s) always returns a String. (That would also argue against a generic reverseind.)

StefanKarpinski

StefanKarpinski commented on Nov 20, 2017

@StefanKarpinski
SponsorMemberAuthor

I think that requiring people to define reverseind for custom string types is pretty unfortunate. It's a complicated and very weird function, yet string types don't work correctly without defining it. That's pretty bad. The other approach that's possible is to double down on RevString and always have reverse(s) return RevString(s).

16 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @StefanKarpinski@stevengj@oxinabox

      Issue actions

        make reverseind generic · Issue #24613 · JuliaLang/julia