Skip to content

String constructor is bad #16713

Closed
Closed
@TotalVerb

Description

@TotalVerb
Contributor

The String constructor here leaves a lot to be desired:

"""
    String(v::Vector{UInt8})

Wrap a vector of bytes encoding string data as UTF-8 in a `String` object.
The resulting `String` object takes ownership of the array.
"""
String(s::Vector{UInt8}) =
    ccall(:jl_pchar_to_string, Ref{String}, (Ptr{UInt8},Int), s, length(s))
  • It makes a copy, so it's very slow. At least in Julia 0.3, no copy was made.
  • The behaviour is different from the docstring. The String object doesn't take ownership of anything.
  • It's unsafe. String([0x80]); is allowed, even though is is an invalid string.
  • It overwrites the String inner constructor, making the inner constructor useless and emitting a warning at build time.

Obviously not all of these can be fixed (points 1 and 3 would move it in different directions), but I think it ought to pick a direction:

  • String should be safe, and check its input (along with making a copy). The method is updated to check validity of UTF8. The docstring is updated. An unsafe_array_to_string replaces the current behaviour, and additionally does not make a copy. The inner constructor is deleted.
  • String should not be safe, and does not check input, and does not make a copy. The method is deleted and the docstring moved to the inner constructor.

Activity

ararslan

ararslan commented on Jun 2, 2016

@ararslan
Member

Ref #16107 (general direction for String) and #16590 (recent developments)

stevengj

stevengj commented on Jun 2, 2016

@stevengj
Member

I think there was a conscious decision by @StefanKarpinski to allow invalid UTF-8 data, so that String containers could be used to pass through arbitrary binary data. This is common to many UTF-8 implementations, including Python's. (The decoding algorithm to extract chars/codepoints must still check for invalid data, of course. There have been a lot of proposals for how to decode such data.)

stevengj

stevengj commented on Jun 2, 2016

@stevengj
Member

See also #16470.

stevengj

stevengj commented on Jun 3, 2016

@stevengj
Member

I think @StefanKarpinski may have accidentally changed the String(::Vector{UInt8}) semantics in #16453. When he renamed the bytestring(::Vector{UInt8}) function to String(::Vector{UInt8}), he overrode the inner String constructor (which doesn't make a copy) from #16212.

Probably bytestring(data::Vector{UInt8}) should have been deprecated to String(copy(data)).

added a commit that references this issue on Jun 3, 2016
92ad69c
added 2 commits that reference this issue on Jun 4, 2016
650978b
77631ce
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @simonbyrne@stevengj@ararslan@TotalVerb

        Issue actions

          `String` constructor is bad · Issue #16713 · JuliaLang/julia