Add `spreadmissings`, the backend for column-wise `@passmissing` #276

pdeffebach · 2021-07-31T16:11:07Z

This is the initial implementation of spreadmissings, which began as a PR in Missings.jl. It's fortunately much simpler than the original implementation since we don't have to work with non-vector inputs.

I'm actually kind of worried about the utility of this function. It works very well with cor and other functions where we need pairs of functions. But it creates dangerous behavior for things like mean.

A user might write f(x, y) = x .- mean(y). If they do spreadmissings(f)(x, y) the mean(y) part of the function will actually be biased, examining only subsets of y where x is also non-missing.

We need a much more heavy-handed name, like nonmissingpairs or something.

Maybe this doesn't need to get merged before 1.0 considering that this API won't have the same name as @passmissing.

See discourse question here also.

pdeffebach · 2021-07-31T18:34:22Z

Okay I've thought about this more.

I think we should still put this in 1.0. We should call it @matchnonmissing, and we should also call the backend function @matchnonmissing.

It's important that people be able to do cor(:x, :y) in DataFramesMeta 1.0.

A more coherent story about automatically creating views which omit missing values, like skipmissings, will have to wait.

bkamins · 2021-07-31T19:35:56Z

src/macros.jl

+            nonmissingmask .&= .!ismissing.(v)
+        end
+
+        nonmissinginds = findall(nonmissingmask)


DataFrames._findall will be more efficient

bkamins · 2021-07-31T19:37:20Z

src/macros.jl

+        firstindices = eachindex(first(vecs))
+        if !all(x -> eachindex(x) == firstindices, vecs)
+            err_str = "Indices of arguments do not match. Indices are " *
+                      join(string.eachindex.(vecs), ", ", " and ") * "."
+
+            throw(ArgumentError(err_str))
+        end


Assuming this is an internal function I think we do not need it. DataFrames.jl enforces 1-based indexing.

bkamins · 2021-07-31T19:39:25Z

src/macros.jl

+
+        nonmissingmask = fill(true, length(vecs[1]))
+        for v in vecs
+            nonmissingmask .&= .!ismissing.(v)


this is inefficient. See https://github.com/JuliaData/DataFrames.jl/blob/main/src/abstractdataframe/abstractdataframe.jl#L758 for an efficient implementation.

bkamins · 2021-07-31T19:41:04Z

src/macros.jl

+
+Given a function `f`, wraps `f` so that a `view` selecting non-missing
+values of `AbstractVector` argument is applied before calling `f`. After `f`
+is called, if the result is a `AbstractVector` of the appropriate length,


define "appropriate length"

also comment exactly what happens otherwise.

bkamins · 2021-07-31T19:45:14Z

src/macros.jl

+
+            return out
+        else
+            return res


I am not sure about this design. The "appropriate length" part seems fragile. I understand the intention, but it seems incorrect somehow.

I feel that the best design would be to have two separate functions:

one leaving the result "as is"

the other explicitly requiring the vector to be of matching length

Yes, another complication is how the spreading works. DataFrames.jl has some fairly complicated rules regarding spreading scalar values across columns.

Let's say you do

@rtransform @passmissing z = mean(:y)

this would currently spread the mean across all rows. But maybe it's more logical to spread it across only non-missing rows. Unfortunately this would require re-implementing the DataFrames.jl spreading logic.

My intention is to resolve such things with JuliaData/DataFrames.jl#2794.
The challenge here, though, is that you most likely want to combine operations having different rows that are skipped in a single call.

Indeed if the result is an AbstractVector it should be required to have a length equal to the number of non-missing entries in the input. Relying on the exact size to choose the behavior is brittle (you could end up having the expected number of elements just by chance, and then your code would break). But relying on the type seems OK, it's very similar to broadcasting.

Let's say you do

@rtransform @passmissing z = mean(:y)

this would currently spread the mean across all rows. But maybe it's more logical to spread it across only non-missing rows. Unfortunately this would require re-implementing the DataFrames.jl spreading logic.

Well it would also be quite common for users to wish to fill all rows with the mean, including those with missing values. That happens for example if you want to create a group-level variable in a multi-level model.

bkamins · 2021-07-31T19:46:49Z

src/macros.jl

@@ -283,6 +283,101 @@ macro byrow(args...)
 end


+struct SpreadMissings{F} <: Function


what is the rationale behind SpreadMissings name? It would be more natural for me to call it something like "complete cases" (or just SkipMissings).

It was originally named for the "spreading" that happens when the shorter vector with the non-missing indices is spread out to match the length of the inputs. But I agree that "complete cases" is better.

The problem with the name SkipMissings is that this type is quite different from SkipMissing due to the "spreading" behavior. Actually, this function can be defined as a vectorized passmissing when the wrapped function returns a vector; but when the function returns a scalar it leaves it as a scalar. This behavior is what seems to make the most sense for users: an alternative would be to "spread" scalars so that missing is used in positions which are missing in the input, but that doesn't sound super useful.

So it's hard to find a good name due to this hybrid behavior. I'd almost call it ArrayPassMissing. Anyway for now it's internal.

But what matters is the API that DataFramesMeta (and probably DataFrames at some point) exposes to users. The approach adopted here is to make @passmissing the go-to solution to handle missing values, which would work for mean(:x), cor(:x, :y), scale(:x), etc. It's appealing as most users wouldn't have to care about the subtle differences between passmissing and skipmissing and their potential variants (mean(skipmissing(:x)) would be an alternative to @passmissing mean(:x) which makes a single pass over the data, but most users probably don't care a lot).

Another approach would be to require users to choose what they want to do, but it would be much more annoying to use:

@skipmissing would use skipmissing∘f for single arguments, or a new variant of skipmissing for multiple arguments which passes a view of their non-missing entries; both would leave the returned value as-is

@passmissing would use passmissing(f) for AbstractArray inputs, or spreadmissing(f) for AbstractArray inputs, and would require the output to be an AbstractArray with its length equal to the number of non-missing entries in the input (or possibly broadcast a scalar result to non-missing positions)

bkamins · 2021-07-31T19:49:22Z

In summary: I still think it is a hard design choice what do to here as the choice is difficult.

Why do you think we must have it for 1.0 release. I would feel that this is something we could add later (adding it in 1.1 release will be non breaking).

pdeffebach · 2021-07-31T20:30:30Z

Maybe this does not need to be in the 1.0 release. The main motivation at the moment is cor, which as you know causes lots of slowdown in the H20 benchmarks. I don't know if I've seen any other obvious motivations, but it does seem bad that you can't take the correlation between two vectors which contain missing values at the moment.

Side note, this infrastructure might also be useful for something like the following:

@transform df :z = @if :y .> :x begin :y .- mean(:y) end

which has also been requested.

But these are APIs that we haven't worked out 100% yet. And we should maybe push for 1.0 before deciding them.

bkamins · 2021-08-01T06:43:01Z

I agree we should solve it, but I assume that whatever we do will not break the other parts of DataFramesMeta.jl. Right?

pdeffebach · 2021-08-01T11:55:26Z

No, it does not require deprecating anything. @completecases will be a good 1.X feature.

bkamins · 2021-08-01T12:25:34Z

But maybe it's more logical to spread it across only non-missing rows.

I was thinking about it.

I think @completecases should relate to what is passed to the function and not how its output is handled.

Obervations:

In R na.rm is an argument of a function and the same thing happens - scalar would be spread over all rows.
If someone wants spreading over non-missing rows then in 95% of cases it is a @byrow operation anyway.

The case :x .- mean(:x) is indeed annoying, but it should be handled In a classical way by :x -. mean(skipmissing(:x)) I think.

pdeffebach · 2021-08-01T12:45:50Z

You might be right, but I'm not 100% sure. Maybe a good compromise would be to require some explicit flags for spreading vector output.

It's definitely more important to spread the result in some sort of futre @if operation.

But it seems like this doesn't need to be in 1.0 so we can table this discussion for now.

pdeffebach · 2021-09-19T19:10:58Z

I just ran into this issue today. We should definitely fix this, and should be easier now that you can do transformations with subsets.

bkamins · 2021-10-10T19:50:22Z

Given the JuliaData/DataFrames.jl#2794 PR we now have if df is a DataFrame:

transform!(dropmissing(df, [:x, :y], view=true), [:x, :y] => cor) |> parent

and if I understand the design of preadmissings correctly the equivalent would be:

transform!([:x, :y] => completecases(cor))

(I use completecases as a name as I think this is a proper name for what is proposed here)

A limitation of completecases in comparison to going through a view is that:

it does not handle AsTable inputs;
it does not handle multicolumn outputs.
it works correctly with select and transform but does not work correctly with combine

But I think it is OK as this is the most common use cases and I think we should optimize this case.

For the case of collect maybe we should have a definition:

completecases(fun::Function, keeprows::Bool=true)

which:

when keeprows=true (the default) and a vector is returned makes sure its length is EXACTLY the required length, and then spreads it; if a non-vector is returned it is left "as is"
when keeprows=false we do not touch vectors
makes sure that only the form of src => completecases(fun) => single_col is accepted (i.e. that multiple columns are not accepted as target column name)

If we agreed to this then maybe even such completecases wrapper could be added to DataFrames.jl as it already defines completecases on data frame. The benefit of a completecases wrapper in DataFrames.jl would be that then we would be able to cleanly handle both [:x, ;y] and also AsTable([:x, :y]) kinds of input column specifiers as transformation minilanguage interpreter would be then aware of this wrapper.

nalimilan · 2021-10-10T20:42:12Z

* it works correctly with `select` and `transform` but does not work correctly with `combine`

What do you mean by this?

If we agreed to this then maybe even such completecases wrapper could be added to DataFrames.jl as it already defines completecases on data frame. The benefit of a completecases wrapper in DataFrames.jl would be that then we would be able to cleanly handle both [:x, ;y] and also AsTable([:x, :y]) kinds of input column specifiers as transformation minilanguage interpreter would be then aware of this wrapper.

I would find it annoying to have a function specific to DataFrames for this, as it would mean that either we still wouldn't provide any solution outside of DataFrames, or that users would have to know two different functions. Maybe we could have a single function which is defined outside of DataFrames, but that can also handle AsTable in the context of DataFrames?

Also while I agree that completecases is a somewhat nice name, currently that function returns a local index, which is quite different. Not sure whether that's a real problem or not.

pdeffebach · 2021-10-10T20:53:17Z

I think completecases is a good name. But for simplicity I think it makes more since to keep this behind the @passmissing macor-flag and not export any particular name.

transform!(dropmissing(df, [:x, :y], view=true), [:x, :y] => cor) |> parent

I definitely think this is the right direction, and might be the right implementation. However I think there are two issues to work through

We want to make a copy at some point, this will mutate df directly. This is fine for @transform!.
This will preserve values of an existing column, which is not what we want in this case, I think.

julia> foo(x::Real, y::Real) = x + y
foo (generic function with 1 method)

julia> df = DataFrame(y = [1, 2], x = [missing, 5])
2×2 DataFrame
 Row │ y      x
     │ Int64  Int64?
─────┼────────────────
   1 │     1  missing
   2 │     2        5

julia> @transform df @passmissing :y = foo.(:x, :y)

should return

df.y = [missing, 7]

Implementing this feature via transformations on sub-dataframes means we don't have to worry about any of the Tables.jl-related stuff, which is nice. Or at least less of it.

That is, we overwrite with missing.

bkamins · 2021-10-10T22:03:47Z

@pdeffebach - where is the definition of @passmissing - as I cannot locate it. In particular - will it make sure that:

it is called only in select or transform context? (I find it slightly problematic in combine context but we can discuss this)
the result will be stored in a single column?

If the answer to both questions is yes then I think the proposed design is OK.

(all comments by @pdeffebach to the views approach are good and show that this is not a sufficient approach 😄)

@nalimilan - if @pdeffebach wants to define @passmissing it will be DataFrames.jl specific. Simply the rules proposed are tied very tightly to the fact that the output of the transformation is stored in the data frame.
If the answers to my questions above are yes then we do not need to add it to DataFrames.jl. The issue is mostly that the generic function that is not DataFrames.jl aware will not be able to do the part "fill the skipped rows with missing" as it is impossible to accomplish in general. Assume that we have a generic table that the function processes and produces another generic table (new table). There is no generic way to add rows to "new table" in places when we dropped rows for processing.

pdeffebach · 2021-10-11T12:23:15Z

@passmissing is a DataFramesMeta.jl-specific macro-flag. It wraps ByRow functions in passmissing to ensure missing propagation, i.e.

@rtransform df :y_float = parse(Float64, :y)

currently this does not work for column-wise transformation (@transform instead of @rtransform). I think this functionality would be @passmissing for column-wise transformations.

it is called only in select or transform context? (I find it slightly problematic in combine context but we can discuss this)

It should be able to be called everywhere, I think, including @combine. Say a function is defined as

collapse(x::AbstractVector{Float64}, y::AbstractVector{Float64})

calling @combine df :z = collapse(:x, :y) will fail if :x or :y contain missings. But doing

@combine df @passmissing :z = collapse(:x, :y)

should enable users to get around this problem.

the result will be stored in a single column?

No, I think we should also support @astable I think. But we don't have @astable working with @passmissing row-wise yet, so I don't know exactly how it will work.

nalimilan · 2021-10-11T14:04:13Z

Yes in this PR we're talking about a DataFrames-specific API, but I think it would be good to keep in mind the broader picture to avoid inconsistencies with the rest of the ecosystem once we add a function to Missings.jl -- just like currently @passmissing parallels passmissing.

If the answers to my questions above are yes then we do not need to add it to DataFrames.jl. The issue is mostly that the generic function that is not DataFrames.jl aware will not be able to do the part "fill the skipped rows with missing" as it is impossible to accomplish in general. Assume that we have a generic table that the function processes and produces another generic table (new table). There is no generic way to add rows to "new table" in places when we dropped rows for processing.

@bkamins Right. But as I proposed above we could override the behavior of a more general function (say, completecases, spreadmissing or passmissings) in the particular context of select/transform/etc. That would be quite ad hoc, but there's no risk of ambiguities as the AsTable destination is specific to DataFrames anyway.

Regarding combine, what problem do you see?

bkamins · 2021-10-11T14:16:49Z

Regarding combine the problem is e.g.:

@combine df @passmissing quantile(:x, 0.0:0.1:1.0)

will try to expand the 11-element vector returned by quantile to the nrow(df) if after skipping missing :x has exactly 11 rows OR it will fail. While in practice the user wanted to just get 11 rows of data always.

pdeffebach · 2021-10-11T21:59:36Z

But this is an issue with the current implementation, right? I think we can get around this with some combination of the following

@select and @transform do some combination of view for subsetting, then copy, then transform! and overwriting existing columns.
@combine just does the view and copy, but calls @combine directly.

bkamins · 2021-10-12T10:44:38Z

I am talking about the design not about the implementation (which I think is a secondary issue and for sure can be resolved). What I ask for is for the clear rules how users can the operation be performed when called in @select vs @combine context. And if the behavior is different it is crucial to clearly explain this to the users.

nalimilan · 2021-10-12T12:28:46Z

Yes for quantile with multiple quantiles the standard approach won't work (neither with combine nor with select actually). We could decide to use a different behavior with combine than with select and transform (i.e. do not "spread" the result), which is what @pdeffebach proposes IIUC. That would make sense given that select can be used instead when a vector with one value for each input row is returned. The drawback is of course that the behavior would be more complex to explain.

Anyway maybe we could disallow using @passmissing with combine when a vector is returned as it shouldn't be super common (the most common case is returning a single scalar or a vector with one entry per input row).

bkamins · 2021-10-12T17:44:19Z

Anyway maybe we could disallow

I think it is a good approach. Given this is an experimental feature I would limit the behavior to the cases we are sure are common and we are clear how they should work.

initial commit

758b1d6

bkamins reviewed Jul 31, 2021

View reviewed changes

This was referenced Sep 23, 2021

Another attempt at an astable flag #298

Merged

Missing values and weighting JuliaStats/Statistics.jl#88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `spreadmissings`, the backend for column-wise `@passmissing` #276

Add `spreadmissings`, the backend for column-wise `@passmissing` #276

pdeffebach commented Jul 31, 2021

pdeffebach commented Jul 31, 2021

bkamins Jul 31, 2021

bkamins Jul 31, 2021

bkamins Jul 31, 2021

bkamins Jul 31, 2021

bkamins Jul 31, 2021

bkamins Jul 31, 2021

pdeffebach Jul 31, 2021

bkamins Aug 1, 2021

nalimilan Sep 20, 2021

bkamins Jul 31, 2021

pdeffebach Jul 31, 2021

nalimilan Sep 20, 2021

bkamins commented Jul 31, 2021

pdeffebach commented Jul 31, 2021

bkamins commented Aug 1, 2021

pdeffebach commented Aug 1, 2021

bkamins commented Aug 1, 2021

pdeffebach commented Aug 1, 2021

pdeffebach commented Sep 19, 2021

bkamins commented Oct 10, 2021

nalimilan commented Oct 10, 2021

pdeffebach commented Oct 10, 2021

bkamins commented Oct 10, 2021

pdeffebach commented Oct 11, 2021 •

edited

Loading

nalimilan commented Oct 11, 2021

bkamins commented Oct 11, 2021

pdeffebach commented Oct 11, 2021

bkamins commented Oct 12, 2021

nalimilan commented Oct 12, 2021

bkamins commented Oct 12, 2021

		@@ -283,6 +283,101 @@ macro byrow(args...)
		end


		struct SpreadMissings{F} <: Function

Add spreadmissings, the backend for column-wise @passmissing #276

Are you sure you want to change the base?

Add spreadmissings, the backend for column-wise @passmissing #276

Conversation

pdeffebach commented Jul 31, 2021

pdeffebach commented Jul 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Jul 31, 2021

pdeffebach commented Jul 31, 2021

bkamins commented Aug 1, 2021

pdeffebach commented Aug 1, 2021

bkamins commented Aug 1, 2021

pdeffebach commented Aug 1, 2021

pdeffebach commented Sep 19, 2021

bkamins commented Oct 10, 2021

nalimilan commented Oct 10, 2021

pdeffebach commented Oct 10, 2021

bkamins commented Oct 10, 2021

pdeffebach commented Oct 11, 2021 • edited Loading

nalimilan commented Oct 11, 2021

bkamins commented Oct 11, 2021

pdeffebach commented Oct 11, 2021

bkamins commented Oct 12, 2021

nalimilan commented Oct 12, 2021

bkamins commented Oct 12, 2021

Add `spreadmissings`, the backend for column-wise `@passmissing` #276

Add `spreadmissings`, the backend for column-wise `@passmissing` #276

pdeffebach commented Oct 11, 2021 •

edited

Loading