Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spreadmissings, the backend for column-wise @passmissing #276

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pdeffebach
Copy link
Collaborator

@bkamins

This is the initial implementation of spreadmissings, which began as a PR in Missings.jl. It's fortunately much simpler than the original implementation since we don't have to work with non-vector inputs.

I'm actually kind of worried about the utility of this function. It works very well with cor and other functions where we need pairs of functions. But it creates dangerous behavior for things like mean.

A user might write f(x, y) = x .- mean(y). If they do spreadmissings(f)(x, y) the mean(y) part of the function will actually be biased, examining only subsets of y where x is also non-missing.

We need a much more heavy-handed name, like nonmissingpairs or something.

Maybe this doesn't need to get merged before 1.0 considering that this API won't have the same name as @passmissing.

See discourse question here also.

@pdeffebach
Copy link
Collaborator Author

Okay I've thought about this more.

I think we should still put this in 1.0. We should call it @matchnonmissing, and we should also call the backend function @matchnonmissing.

It's important that people be able to do cor(:x, :y) in DataFramesMeta 1.0.

A more coherent story about automatically creating views which omit missing values, like skipmissings, will have to wait.

nonmissingmask .&= .!ismissing.(v)
end

nonmissinginds = findall(nonmissingmask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrames._findall will be more efficient

Comment on lines +293 to +299
firstindices = eachindex(first(vecs))
if !all(x -> eachindex(x) == firstindices, vecs)
err_str = "Indices of arguments do not match. Indices are " *
join(string.eachindex.(vecs), ", ", " and ") * "."

throw(ArgumentError(err_str))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this is an internal function I think we do not need it. DataFrames.jl enforces 1-based indexing.


nonmissingmask = fill(true, length(vecs[1]))
for v in vecs
nonmissingmask .&= .!ismissing.(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Given a function `f`, wraps `f` so that a `view` selecting non-missing
values of `AbstractVector` argument is applied before calling `f`. After `f`
is called, if the result is a `AbstractVector` of the appropriate length,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define "appropriate length"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also comment exactly what happens otherwise.


return out
else
return res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this design. The "appropriate length" part seems fragile. I understand the intention, but it seems incorrect somehow.

I feel that the best design would be to have two separate functions:

  • one leaving the result "as is"
  • the other explicitly requiring the vector to be of matching length

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, another complication is how the spreading works. DataFrames.jl has some fairly complicated rules regarding spreading scalar values across columns.

Let's say you do

@rtransform @passmissing z = mean(:y)

this would currently spread the mean across all rows. But maybe it's more logical to spread it across only non-missing rows. Unfortunately this would require re-implementing the DataFrames.jl spreading logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to resolve such things with JuliaData/DataFrames.jl#2794.
The challenge here, though, is that you most likely want to combine operations having different rows that are skipped in a single call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed if the result is an AbstractVector it should be required to have a length equal to the number of non-missing entries in the input. Relying on the exact size to choose the behavior is brittle (you could end up having the expected number of elements just by chance, and then your code would break). But relying on the type seems OK, it's very similar to broadcasting.

Let's say you do

@rtransform @passmissing z = mean(:y)

this would currently spread the mean across all rows. But maybe it's more logical to spread it across only non-missing rows. Unfortunately this would require re-implementing the DataFrames.jl spreading logic.

Well it would also be quite common for users to wish to fill all rows with the mean, including those with missing values. That happens for example if you want to create a group-level variable in a multi-level model.

@@ -283,6 +283,101 @@ macro byrow(args...)
end


struct SpreadMissings{F} <: Function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the rationale behind SpreadMissings name? It would be more natural for me to call it something like "complete cases" (or just SkipMissings).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally named for the "spreading" that happens when the shorter vector with the non-missing indices is spread out to match the length of the inputs. But I agree that "complete cases" is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the name SkipMissings is that this type is quite different from SkipMissing due to the "spreading" behavior. Actually, this function can be defined as a vectorized passmissing when the wrapped function returns a vector; but when the function returns a scalar it leaves it as a scalar. This behavior is what seems to make the most sense for users: an alternative would be to "spread" scalars so that missing is used in positions which are missing in the input, but that doesn't sound super useful.

So it's hard to find a good name due to this hybrid behavior. I'd almost call it ArrayPassMissing. Anyway for now it's internal.

But what matters is the API that DataFramesMeta (and probably DataFrames at some point) exposes to users. The approach adopted here is to make @passmissing the go-to solution to handle missing values, which would work for mean(:x), cor(:x, :y), scale(:x), etc. It's appealing as most users wouldn't have to care about the subtle differences between passmissing and skipmissing and their potential variants (mean(skipmissing(:x)) would be an alternative to @passmissing mean(:x) which makes a single pass over the data, but most users probably don't care a lot).

Another approach would be to require users to choose what they want to do, but it would be much more annoying to use:

  • @skipmissing would use skipmissing∘f for single arguments, or a new variant of skipmissing for multiple arguments which passes a view of their non-missing entries; both would leave the returned value as-is
  • @passmissing would use passmissing(f) for AbstractArray inputs, or spreadmissing(f) for AbstractArray inputs, and would require the output to be an AbstractArray with its length equal to the number of non-missing entries in the input (or possibly broadcast a scalar result to non-missing positions)

@bkamins
Copy link
Member

bkamins commented Jul 31, 2021

In summary: I still think it is a hard design choice what do to here as the choice is difficult.

Why do you think we must have it for 1.0 release. I would feel that this is something we could add later (adding it in 1.1 release will be non breaking).

@pdeffebach
Copy link
Collaborator Author

Maybe this does not need to be in the 1.0 release. The main motivation at the moment is cor, which as you know causes lots of slowdown in the H20 benchmarks. I don't know if I've seen any other obvious motivations, but it does seem bad that you can't take the correlation between two vectors which contain missing values at the moment.

Side note, this infrastructure might also be useful for something like the following:

@transform df :z = @if :y .> :x begin :y .- mean(:y) end

which has also been requested.

But these are APIs that we haven't worked out 100% yet. And we should maybe push for 1.0 before deciding them.

@bkamins
Copy link
Member

bkamins commented Aug 1, 2021

I agree we should solve it, but I assume that whatever we do will not break the other parts of DataFramesMeta.jl. Right?

@pdeffebach
Copy link
Collaborator Author

No, it does not require deprecating anything. @completecases will be a good 1.X feature.

@bkamins
Copy link
Member

bkamins commented Aug 1, 2021

But maybe it's more logical to spread it across only non-missing rows.

I was thinking about it.

I think @completecases should relate to what is passed to the function and not how its output is handled.

Obervations:

  1. In R na.rm is an argument of a function and the same thing happens - scalar would be spread over all rows.
  2. If someone wants spreading over non-missing rows then in 95% of cases it is a @byrow operation anyway.

The case :x .- mean(:x) is indeed annoying, but it should be handled In a classical way by :x -. mean(skipmissing(:x)) I think.

@pdeffebach
Copy link
Collaborator Author

You might be right, but I'm not 100% sure. Maybe a good compromise would be to require some explicit flags for spreading vector output.

It's definitely more important to spread the result in some sort of futre @if operation.

But it seems like this doesn't need to be in 1.0 so we can table this discussion for now.

@pdeffebach
Copy link
Collaborator Author

I just ran into this issue today. We should definitely fix this, and should be easier now that you can do transformations with subsets.

@bkamins
Copy link
Member

bkamins commented Oct 10, 2021

Given the JuliaData/DataFrames.jl#2794 PR we now have if df is a DataFrame:

transform!(dropmissing(df, [:x, :y], view=true), [:x, :y] => cor) |> parent

and if I understand the design of preadmissings correctly the equivalent would be:

transform!([:x, :y] => completecases(cor))

(I use completecases as a name as I think this is a proper name for what is proposed here)

A limitation of completecases in comparison to going through a view is that:

  • it does not handle AsTable inputs;
  • it does not handle multicolumn outputs.
  • it works correctly with select and transform but does not work correctly with combine

But I think it is OK as this is the most common use cases and I think we should optimize this case.

For the case of collect maybe we should have a definition:

completecases(fun::Function, keeprows::Bool=true)

which:

  • when keeprows=true (the default) and a vector is returned makes sure its length is EXACTLY the required length, and then spreads it; if a non-vector is returned it is left "as is"
  • when keeprows=false we do not touch vectors
  • makes sure that only the form of src => completecases(fun) => single_col is accepted (i.e. that multiple columns are not accepted as target column name)

If we agreed to this then maybe even such completecases wrapper could be added to DataFrames.jl as it already defines completecases on data frame. The benefit of a completecases wrapper in DataFrames.jl would be that then we would be able to cleanly handle both [:x, ;y] and also AsTable([:x, :y]) kinds of input column specifiers as transformation minilanguage interpreter would be then aware of this wrapper.

@nalimilan
Copy link
Member

* it works correctly with `select` and `transform` but does not work correctly with `combine`

What do you mean by this?

If we agreed to this then maybe even such completecases wrapper could be added to DataFrames.jl as it already defines completecases on data frame. The benefit of a completecases wrapper in DataFrames.jl would be that then we would be able to cleanly handle both [:x, ;y] and also AsTable([:x, :y]) kinds of input column specifiers as transformation minilanguage interpreter would be then aware of this wrapper.

I would find it annoying to have a function specific to DataFrames for this, as it would mean that either we still wouldn't provide any solution outside of DataFrames, or that users would have to know two different functions. Maybe we could have a single function which is defined outside of DataFrames, but that can also handle AsTable in the context of DataFrames?

Also while I agree that completecases is a somewhat nice name, currently that function returns a local index, which is quite different. Not sure whether that's a real problem or not.

@pdeffebach
Copy link
Collaborator Author

I think completecases is a good name. But for simplicity I think it makes more since to keep this behind the @passmissing macor-flag and not export any particular name.

transform!(dropmissing(df, [:x, :y], view=true), [:x, :y] => cor) |> parent

I definitely think this is the right direction, and might be the right implementation. However I think there are two issues to work through

  1. We want to make a copy at some point, this will mutate df directly. This is fine for @transform!.
  2. This will preserve values of an existing column, which is not what we want in this case, I think.
julia> foo(x::Real, y::Real) = x + y
foo (generic function with 1 method)

julia> df = DataFrame(y = [1, 2], x = [missing, 5])
2×2 DataFrame
 Row │ y      x
     │ Int64  Int64?
─────┼────────────────
   1 │     1  missing
   2 │     2        5

julia> @transform df @passmissing :y = foo.(:x, :y)

should return

df.y = [missing, 7]

Implementing this feature via transformations on sub-dataframes means we don't have to worry about any of the Tables.jl-related stuff, which is nice. Or at least less of it.

That is, we overwrite with missing.

@bkamins
Copy link
Member

bkamins commented Oct 10, 2021

@pdeffebach - where is the definition of @passmissing - as I cannot locate it. In particular - will it make sure that:

  • it is called only in select or transform context? (I find it slightly problematic in combine context but we can discuss this)
  • the result will be stored in a single column?

If the answer to both questions is yes then I think the proposed design is OK.

(all comments by @pdeffebach to the views approach are good and show that this is not a sufficient approach 😄)

@nalimilan - if @pdeffebach wants to define @passmissing it will be DataFrames.jl specific. Simply the rules proposed are tied very tightly to the fact that the output of the transformation is stored in the data frame.
If the answers to my questions above are yes then we do not need to add it to DataFrames.jl. The issue is mostly that the generic function that is not DataFrames.jl aware will not be able to do the part "fill the skipped rows with missing" as it is impossible to accomplish in general. Assume that we have a generic table that the function processes and produces another generic table (new table). There is no generic way to add rows to "new table" in places when we dropped rows for processing.

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Oct 11, 2021

@passmissing is a DataFramesMeta.jl-specific macro-flag. It wraps ByRow functions in passmissing to ensure missing propagation, i.e.

@rtransform df :y_float = parse(Float64, :y)

currently this does not work for column-wise transformation (@transform instead of @rtransform). I think this functionality would be @passmissing for column-wise transformations.

  • it is called only in select or transform context? (I find it slightly problematic in combine context but we can discuss this)

It should be able to be called everywhere, I think, including @combine. Say a function is defined as

collapse(x::AbstractVector{Float64}, y::AbstractVector{Float64})

calling @combine df :z = collapse(:x, :y) will fail if :x or :y contain missings. But doing

@combine df @passmissing :z = collapse(:x, :y)

should enable users to get around this problem.

  • the result will be stored in a single column?

No, I think we should also support @astable I think. But we don't have @astable working with @passmissing row-wise yet, so I don't know exactly how it will work.

@nalimilan
Copy link
Member

Yes in this PR we're talking about a DataFrames-specific API, but I think it would be good to keep in mind the broader picture to avoid inconsistencies with the rest of the ecosystem once we add a function to Missings.jl -- just like currently @passmissing parallels passmissing.

If the answers to my questions above are yes then we do not need to add it to DataFrames.jl. The issue is mostly that the generic function that is not DataFrames.jl aware will not be able to do the part "fill the skipped rows with missing" as it is impossible to accomplish in general. Assume that we have a generic table that the function processes and produces another generic table (new table). There is no generic way to add rows to "new table" in places when we dropped rows for processing.

@bkamins Right. But as I proposed above we could override the behavior of a more general function (say, completecases, spreadmissing or passmissings) in the particular context of select/transform/etc. That would be quite ad hoc, but there's no risk of ambiguities as the AsTable destination is specific to DataFrames anyway.

Regarding combine, what problem do you see?

@bkamins
Copy link
Member

bkamins commented Oct 11, 2021

Regarding combine the problem is e.g.:

@combine df @passmissing quantile(:x, 0.0:0.1:1.0)

will try to expand the 11-element vector returned by quantile to the nrow(df) if after skipping missing :x has exactly 11 rows OR it will fail. While in practice the user wanted to just get 11 rows of data always.

@pdeffebach
Copy link
Collaborator Author

But this is an issue with the current implementation, right? I think we can get around this with some combination of the following

  1. @select and @transform do some combination of view for subsetting, then copy, then transform! and overwriting existing columns.

  2. @combine just does the view and copy, but calls @combine directly.

@bkamins
Copy link
Member

bkamins commented Oct 12, 2021

I am talking about the design not about the implementation (which I think is a secondary issue and for sure can be resolved). What I ask for is for the clear rules how users can the operation be performed when called in @select vs @combine context. And if the behavior is different it is crucial to clearly explain this to the users.

@nalimilan
Copy link
Member

Yes for quantile with multiple quantiles the standard approach won't work (neither with combine nor with select actually). We could decide to use a different behavior with combine than with select and transform (i.e. do not "spread" the result), which is what @pdeffebach proposes IIUC. That would make sense given that select can be used instead when a vector with one value for each input row is returned. The drawback is of course that the behavior would be more complex to explain.

Anyway maybe we could disallow using @passmissing with combine when a vector is returned as it shouldn't be super common (the most common case is returning a single scalar or a vector with one entry per input row).

@bkamins
Copy link
Member

bkamins commented Oct 12, 2021

Anyway maybe we could disallow

I think it is a good approach. Given this is an experimental feature I would limit the behavior to the cases we are sure are common and we are clear how they should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants