Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle UTF-8 code points in StdString #381

Merged
merged 14 commits into from
Oct 21, 2023
Merged

Conversation

omus
Copy link
Contributor

@omus omus commented Oct 18, 2023

Update StdString Julia string interface to handle UTF-8 code points. For example on CxxWrap 0.14:

julia> using CxxWrap

julia> str = "¿\0ÿ"
"¿\0ÿ"

julia> codeunits(str)
5-element Base.CodeUnits{UInt8, String}:
 0xc2
 0xbf
 0x00
 0xc3
 0xbf

julia> std_str = StdString(str, ncodeunits(str))
"¿"

julia> collect(std_str)
5-element Vector{Char}:
 'Â': Unicode U+00C2 (category Lu: Letter, uppercase)
 '¿': Unicode U+00BF (category Po: Punctuation, other)
 '\0': ASCII/Unicode U+0000 (category Cc: Other, control)
 'Ã': Unicode U+00C3 (category Lu: Letter, uppercase)
 '¿': Unicode U+00BF (category Po: Punctuation, other)

julia> codeunits(std_str)
5-element Base.CodeUnits{UInt8, CxxWrap.StdLib.StdStringAllocated}:
 0xc2
 0xbf
 0x00
 0xc3
 0xbf

julia> String(codeunits(std_str))
"¿\0ÿ"

julia> convert(String, convert(StdString, "¿\0ÿ"))
"¿"

This PR includes the following changes:

@omus
Copy link
Contributor Author

omus commented Oct 18, 2023

Looking into the CI failures

@barche
Copy link
Collaborator

barche commented Oct 18, 2023

Thanks, to me this looks good, I'm CC'ing @kleinschmidt since he recently contributed to the string code as well.

Copy link
Contributor

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any major objections to this; it may be surprising to users that you can't construct a StdString from a String with null bytes but uh that's also probably a pretty small set of users so I think this is okay

I do think we shoudl be documenting this a bit more carefully, since it's been a footgun for us quite a few times recently!

src/StdLib.jl Show resolved Hide resolved
@omus
Copy link
Contributor Author

omus commented Oct 20, 2023

I don't have any major objections to this; it may be surprising to users that you can't construct a StdString from a String with null bytes but uh that's also probably a pretty small set of users so I think this is okay

I do think we shoudl be documenting this a bit more carefully, since it's been a footgun for us quite a few times recently!

I completely agree. I'll add a docstring to StdString explaining this.

Additionally, I'll validate this PR's changes against the Ray.jl repo which currently implements a custom conversion function to losslessly convert StdStrings.

@omus
Copy link
Contributor Author

omus commented Oct 20, 2023

I took another look into Cstring and discovered that we did have the option of constructing null-terminated strings separately from the StdString(::String) constructor (I was original misled that Base.cconvert(Cstring, "foo")::String). Users of CxxWrap should no longer be surprized when converting between StdString and Strings which contain null-characters. For those who want the c-string constructor behaviour you can now use StdString(::Cstring)/StdString(::Base.CodeUnits)/StdString(::Vector{UInt8}).

Copy link
Contributor

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in good shape. For the sake of completeness, could you add a short description to the README as well?

src/StdLib.jl Outdated Show resolved Hide resolved
@omus
Copy link
Contributor Author

omus commented Oct 20, 2023

Validating these changes against Ray.jl works well and revealed two things:

  1. The Julia string interface methods don't work with ConstCxxRef{StdString} without dereferencing
  2. Serializing StdString fails when trying to deserialize.

I don't think either of these need to be addressed as part of this PR. Examples for those who like them:

julia> using CxxWrap, Serialization

julia> collect(ConstCxxRef(StdString("foo")))
ERROR: MethodError: no method matching size(::CxxWrap.StdLib.StdStringDereferenced)

Closest candidates are:
  size(::Union{LinearAlgebra.Adjoint{T, var"#s972"}, LinearAlgebra.Transpose{T, var"#s972"}} where {T, var"#s972"<:(AbstractVector)})
   @ LinearAlgebra ~/Development/Julia/aarch64/1.9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/adjtrans.jl:296
  size(::Union{LinearAlgebra.Adjoint{T, var"#s972"}, LinearAlgebra.Transpose{T, var"#s972"}} where {T, var"#s972"<:(AbstractMatrix)})
   @ LinearAlgebra ~/Development/Julia/aarch64/1.9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/adjtrans.jl:297
  size(::Union{LinearAlgebra.QR, LinearAlgebra.QRCompactWY, LinearAlgebra.QRPivoted})
   @ LinearAlgebra ~/Development/Julia/aarch64/1.9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/qr.jl:582
  ...

julia> x = sprint(serialize, StdString("foo"))
"7JL\x14\x04\0\0\x005\x10\x01\x12StdStringAllocated\x1f\v\xf4}Iz\xf0\x891\xae*Zʗ<\xa4\x15\x1f\x01\aCxxWrap\x01\x06StdLibD4\x10T\x1fN\x9bD\x01\0\0\0\x10\x01\aNothing\x1fN\x9bD\0\0\0\0\0\0\0\0"

julia> deserialize(IOBuffer(x))
Error showing value of type CxxWrap.StdLib.StdStringAllocated:
ERROR: C++ object of type NSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE was deleted

@omus
Copy link
Contributor Author

omus commented Oct 20, 2023

Should be good to go from my end.

@barche
Copy link
Collaborator

barche commented Oct 21, 2023

I have never tested serialization/deserialization with any C++ type, so that is a whole other can of worms. Thanks for these changes to you both!

@barche barche merged commit 81864a2 into JuliaInterop:main Oct 21, 2023
9 checks passed
@omus omus deleted the cv/stdstring-utf8 branch October 23, 2023 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants