-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse categorical arrays? #374
Comments
I have no particular contributions to offer, but I am interested in learning something about the subject and I would like to ask a few questions. |
Essentially I have thousands of columns with categorical entries and lots (> 90%) of julia> using DataFramesMeta, SparseArrays;
julia> df = DataFrame(x = sparse([1, 0, 0, 0, 1]))
5×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 0
3 │ 0
4 │ 0
5 │ 1
julia> @subset! df :x .== 1
ERROR: MethodError: no method matching deleteat!(::SparseVector{Int64, Int64}, ::Vector{Int64})
Closest candidates are:
deleteat!(::Vector{T} where T, ::AbstractVector{T} where T) at array.jl:1377
deleteat!(::Vector{T} where T, ::Any) at array.jl:1376
deleteat!(::BitVector, ::Any) at bitarray.jl:981
...
Stacktrace:
[1] _delete!_helper(df::DataFrame, drop::Vector{Int64})
@ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:999
[2] delete!(df::DataFrame, inds::Vector{Int64})
@ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:976
[3] subset!(df::DataFrame, args::Any; skipmissing::Bool)
@ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/subset.jl:292
[4] top-level scope
@ ~/.julia/packages/DataFramesMeta/lbAjC/src/macros.jl:681 |
This is unrelated with DataFrames.jl. The problem is with your column not defining |
But I guess there must be a good reason why |
CategoricalArrays are already "sparse" in the sense that they don't take up the full memory. So I don't think this is needed.
|
But they do take memory "linear" to their size, while for sparse vectors the memory taken is proportional to non-zero entries. If data is very sparse there is a very big difference between these two scenarios. Still for the reason that @pdeffebach raises probably this is not a top priority to do. However, your measure of memory footprint is incorrect. You should do:
|
Yes I agree with both of you. I just thought it would be a nice-to-have in my case, since categorical arrays still store one integer per value no matter what (the category rank). With |
I would file an issue in Base, as well, about re-sizing sparse arrays. See what they think. |
I think it would not be that hard. You probably would need to specify what value you want to be treated as "zero" of sparse array. However, I would propose to wait for @nalimilan to comment how he sees adding it to CategoricalArrays.jl from a general perspective of package maintenance. |
Yes, let's start with @pdeffebach's suggestion to shake things up in the stdlib |
In theory it shouldn't be too hard to support sparse array-backed You're welcome to experiment with this if that would be useful for you. Though I cannot promise I will accept the change if it turns out it increases the package's complexity too much. As you can see, the number of type parameters is already quite large... (One possible advantage of supporting any array type for Out of curiosity, could you develop in what kind of situation you have this kind of data? |
My data points were generated by many semi-independent sub-systems, and not every sub-system logs the same columns, which means most columns are very sparse |
Hey there,
Thanks for this really nice package! In order to further decrease the memory footprint of my dataframes, I think a sparse version of
CategoricalArray
might be useful. I'm not really sure how to implement this yet, but I'm willing to take a look.Would you welcome a PR in this direction? Do you have any tips?
The text was updated successfully, but these errors were encountered: