-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization #60
Comments
Here's a little benchmark using Julia 0.5.0, NullableArrays 0.1.0 CategoricalArrays 0.1.2 using StatsBase, CategoricalArrays, NullableArrays
arr = sample(["A", "B", "C", "D", "E", "F", "G", "H", Nullable()], 10^7);
open("arr.jlser", "w") do io serialize(io, arr) end
open("null_arr.jlser", "w") do io serialize(io, NullableArray(arr)) end
open("null_cat_arr.jlser", "w") do io serialize(io, categorical(arr)) end
map(filesize, ["arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
# 3-element Array{Int64,1}:
# 266664515
# 38641606
# 40000939 It looks strange that categorical is that big. |
Here's the deserialization results. Unfortunately, I wasn't able to install @time open("arr.jlser", "r") do io deserialize(io) end
# 58.859399 seconds (211.15 M allocations: 8.504 GB, 21.28% gc time)
@time open("arr.jlser", "r") do io deserialize(io) end;
# 82.845217 seconds (211.11 M allocations: 8.502 GB, 43.56% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 39.956336 seconds (111.14 M allocations: 4.936 GB, 55.29% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 23.724424 seconds (111.11 M allocations: 4.935 GB, 25.49% gc time)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
# 0.225701 seconds (106.56 k allocations: 42.491 MB)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
# 0.092174 seconds (1.97 k allocations: 38.244 MB)
|
Interesting. I would also compare with The filesize for the categorical array (40000939) is exactly what I would expect: 4 bytes per element plus some negligible overhead. It's hard to be more efficient than that. I'm not sure how As for possible optimizations, we could rebuild |
Ah, right, categorical array uses Here's the results for arr2 = sample(["A", "B", "C", "D", "E", "F", "G", "H", "G"], 10^7);
open("arr2.jlser", "w") do io serialize(io, arr2) end
map(filesize, ["arr2.jlser", "arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
#4-element Array{Int64,1}:
# 40000012
# 266663942
# 38640383
# 40000939
@time open("arr2.jlser", "r") do io deserialize(io) end
# 39.064836 seconds (110.00 M allocations: 5.290 GB, 47.89% gc time)
@time open("arr2.jlser", "r") do io deserialize(io) end
# 36.619610 seconds (110.00 M allocations: 5.290 GB, 43.79% gc time) I guess there is a room for improvement in Base. I tried to use serialization mechanisms to load/save moderately sized data frames (~5.5GB gzipped), and loading is terribly slow. |
Yeah, I guess we could use the same logic as in
There should be only about 11% of nulls in the vector, so that can explain the difference compared with Also it's really weird that deserializing |
OTOH, since serialization tries to restore the references, it's better to keep the serialization format as close to the type as possible. Gzipping should take care of it automatically: gzopen("arr.jlser.gz", "w") do io serialize(io, arr) end
gzopen("arr2.jlser.gz", "w") do io serialize(io, arr2) end
gzopen("null_arr.jlser.gz", "w") do io serialize(io, NullableArray(arr)) end
gzopen("null_cat_arr.jlser.gz", "w") do io serialize(io, categorical(arr)) end
map(filesize, ["arr2.jlser.gz", "arr.jlser.gz", "null_arr.jlser.gz", "null_cat_arr.jlser.gz"])
#4-element Array{Int64,1}:
# 5801136
# 8686152
# 7389933
# 6254812
The contents of
In hex
|
Of course, that's because your strings are only one character, so 1 byte, plus apparently 3 bytes of overhead. With longer level names, the story would be quite different! |
Relevant upstream issue JuliaLang/julia#18633 |
The representation of |
Had not tested v0.6, but using StatsBase, CategoricalArrays, NullableArrays
arr = sample(["A", "B", "C", "D", "E", "F", "G", "H", Nullable()], 10^7);
open("arr.jlser", "w") do io serialize(io, arr) end;
open("null_arr.jlser", "w") do io serialize(io, NullableArray(arr)) end;
open("null_cat_arr.jlser", "w") do io serialize(io, categorical(arr)) end;
arr2 = sample(["A", "B", "C", "D", "E", "F", "G", "H", "G"], 10^7);
open("arr2.jlser", "w") do io serialize(io, arr2) end;
map(filesize, ["arr2.jlser", "arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
#4-element Array{Int64,1}:
# 40000012
# 266670491
# 38643340
# 40000939
@time open("arr.jlser", "r") do io deserialize(io) end;
# 83.365617 seconds (127.84 M allocations: 4.879 GB, 7.27% gc time)
@time open("arr.jlser", "r") do io deserialize(io) end;
# 94.405233 seconds (127.78 M allocations: 4.876 GB, 16.81% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 16.028068 seconds (57.82 M allocations: 1.758 GB, 39.62% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 12.801466 seconds (57.78 M allocations: 1.756 GB, 26.86% gc time)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
# 0.271359 seconds (106.03 k allocations: 42.476 MB)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
# 0.034357 seconds (1.73 k allocations: 38.231 MB)
@time open("arr2.jlser", "r") do io deserialize(io) end;
# 11.650630 seconds (50.00 M allocations: 1.714 GB, 14.06% gc time)
@time open("arr2.jlser", "r") do io deserialize(io) end;
# 12.030219 seconds (50.00 M allocations: 1.714 GB, 14.30% gc time) |
Finally, v0.6 using the official nightly. Everything is improved, although with the exception of categorical arrays, there's still considerable time/memory overhead. arr = sample(["A", "B", "C", "D", "E", "F", "G", "H", Nullable()], 10^7);
open("arr.jlser", "w") do io serialize(io, arr) end;
open("null_arr.jlser", "w") do io serialize(io, NullableArray(arr)) end;
open("null_cat_arr.jlser", "w") do io serialize(io, categorical(arr)) end;
arr2 = sample(["A", "B", "C", "D", "E", "F", "G", "H", "G"], 10^7);
open("arr2.jlser", "w") do io serialize(io, arr2) end;
map(filesize, ["arr2.jlser", "arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
#4-element Array{Int64,1}:
# 100000012
# 320003354
# 91977889
# 40001083
@time open("arr.jlser", "r") do io deserialize(io) end;
# 21.946035 seconds (89.01 M allocations: 3.474 GiB, 15.27% gc time)
@time open("arr.jlser", "r") do io deserialize(io) end;
# 26.444456 seconds (88.89 M allocations: 3.469 GiB, 26.98% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 7.818007 seconds (48.89 M allocations: 1.094 GiB, 35.74% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 5.232938 seconds (48.89 M allocations: 1.094 GiB, 13.68% gc time)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
# 0.148508 seconds (33.76 k allocations: 39.701 MiB)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
# 0.031035 seconds (2.04 k allocations: 38.254 MiB)
@time open("arr2.jlser", "r") do io deserialize(io) end;
# 5.172547 seconds (40.00 M allocations: 991.886 MiB, 15.52% gc time)
@time open("arr2.jlser", "r") do io deserialize(io) end;
# 4.753201 seconds (40.00 M allocations: 991.886 MiB, 12.20% gc time) |
Much better indeed. Could be useful to post the relevant timings on the Julia issue too. I'm not sure what other software could be used to compare serialization performance. Maybe Feather and HDF5? |
Also R's |
Was the performance/storage efficiency of
[Nullable]CategoricalArray
serialization checked?Would it make sense to override
serialize()/deserialize()
forCategoryPool
(invindex
and especiallyvalindex
fields could be reconstructed fromindex
)?The text was updated successfully, but these errors were encountered: