-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: allow skipmissing
column types
#3398
Comments
I understand your concern and share it. There is a wide difference between "production code" and "data discovery" workflows. What you ask for is doable already with metadata. However, I thnk a better solution is rather to have a set of functions that provide an alternative set of behaviors. This is what https://sl-solution.github.io/InMemoryDatasets.jl/stable/man/missing/#Functions-which-skip-missing-values does. The question is, though, how to get a common agreement how to approach it in terms of package ecosystem. |
From https://discourse.julialang.org/t/why-are-missing-values-not-ignored-by-default/106756/115?u=mkitti , it does not appear that hard to do. I'm not clear if this should be part of DataFrames.jl though. julia> using CSV, DataFrames, Statistics
julia> struct SkipMissingDataFrame
parent::DataFrame
end
julia> Base.parent(smdf::SkipMissingDataFrame) = getfield(smdf, :parent)
julia> Base.getproperty(smdf::SkipMissingDataFrame, sym::Symbol) = skipmissing(Base.getproperty(parent(smdf), sym))
julia> write("blah.csv","""
"col1", "col2"
"5", "6"
"1", "2"
"30", "31"
"22", "23"
"NA"
"50"
""")
65
julia> df = CSV.read("blah.csv", DataFrame; silencewarnings=true);
julia> smdf = SkipMissingDataFrame(df)
SkipMissingDataFrame(6×2 DataFrame
Row │ col1 col2
│ String3 Int64?
─────┼──────────────────
1 │ 5 6
2 │ 1 2
3 │ 30 31
4 │ 22 23
5 │ NA missing
6 │ 50 missing )
julia> smdf.col2 |> mean
15.5
julia> smdf.col2 |> x->Iterators.filter(>(10),x) |> mean
27.0 |
This example is indeed simple, but as soon as you want to support operations on data frames, you have to reimplement all of the DataFrames.jl API. It's doable but quite some code. This also creates new issues: I tend to think that this would be better handled with improved macros in DataFramesMeta. See also #2314. |
I understand the rationale for the very elaborate
missing
logic in that it forces the user to be explicit about how to handle missing values and potentially avoids sneaky statistical bugshowever
for "quick and dirty" tasks just trying to make sense of some data, it quickly becomes cumbersome to constantly be wrapping things in
skipmissing
ordropmissing
etc. etc.I would love some way to tag columns (or the whole table) as
skipmissing
'ed so that all future transformations will automatically insert askipmissing
. maybe liketransform(df, All() .=> skipmissing)
orskipmissing!(df)
or suchThe text was updated successfully, but these errors were encountered: