Closed
Description
The String
constructor here leaves a lot to be desired:
"""
String(v::Vector{UInt8})
Wrap a vector of bytes encoding string data as UTF-8 in a `String` object.
The resulting `String` object takes ownership of the array.
"""
String(s::Vector{UInt8}) =
ccall(:jl_pchar_to_string, Ref{String}, (Ptr{UInt8},Int), s, length(s))
- It makes a copy, so it's very slow. At least in Julia 0.3, no copy was made.
- The behaviour is different from the docstring. The
String
object doesn't take ownership of anything. - It's unsafe.
String([0x80]);
is allowed, even though is is an invalid string. - It overwrites the
String
inner constructor, making the inner constructor useless and emitting a warning at build time.
Obviously not all of these can be fixed (points 1 and 3 would move it in different directions), but I think it ought to pick a direction:
String
should be safe, and check its input (along with making a copy). The method is updated to check validity of UTF8. The docstring is updated. Anunsafe_array_to_string
replaces the current behaviour, and additionally does not make a copy. The inner constructor is deleted.String
should not be safe, and does not check input, and does not make a copy. The method is deleted and the docstring moved to the inner constructor.
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
Parser
module JuliaIO/JSON.jl#140ararslan commentedon Jun 2, 2016
Ref #16107 (general direction for
String
) and #16590 (recent developments)stevengj commentedon Jun 2, 2016
I think there was a conscious decision by @StefanKarpinski to allow invalid UTF-8 data, so that
String
containers could be used to pass through arbitrary binary data. This is common to many UTF-8 implementations, including Python's. (The decoding algorithm to extract chars/codepoints must still check for invalid data, of course. There have been a lot of proposals for how to decode such data.)stevengj commentedon Jun 2, 2016
See also #16470.
stevengj commentedon Jun 3, 2016
I think @StefanKarpinski may have accidentally changed the
String(::Vector{UInt8})
semantics in #16453. When he renamed thebytestring(::Vector{UInt8})
function toString(::Vector{UInt8})
, he overrode the innerString
constructor (which doesn't make a copy) from #16212.Probably
bytestring(data::Vector{UInt8})
should have been deprecated toString(copy(data))
.rename pointer-based string and array constructors to unsafe_string, …
rename pointer-based string and array constructors to unsafe_string, …
rename pointer-based string and array constructors to unsafe_string, …