Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spreadmissings, the backend for column-wise @passmissing #276

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions src/macros.jl
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,101 @@ macro byrow(args...)
end


struct SpreadMissings{F} <: Function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the rationale behind SpreadMissings name? It would be more natural for me to call it something like "complete cases" (or just SkipMissings).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally named for the "spreading" that happens when the shorter vector with the non-missing indices is spread out to match the length of the inputs. But I agree that "complete cases" is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the name SkipMissings is that this type is quite different from SkipMissing due to the "spreading" behavior. Actually, this function can be defined as a vectorized passmissing when the wrapped function returns a vector; but when the function returns a scalar it leaves it as a scalar. This behavior is what seems to make the most sense for users: an alternative would be to "spread" scalars so that missing is used in positions which are missing in the input, but that doesn't sound super useful.

So it's hard to find a good name due to this hybrid behavior. I'd almost call it ArrayPassMissing. Anyway for now it's internal.

But what matters is the API that DataFramesMeta (and probably DataFrames at some point) exposes to users. The approach adopted here is to make @passmissing the go-to solution to handle missing values, which would work for mean(:x), cor(:x, :y), scale(:x), etc. It's appealing as most users wouldn't have to care about the subtle differences between passmissing and skipmissing and their potential variants (mean(skipmissing(:x)) would be an alternative to @passmissing mean(:x) which makes a single pass over the data, but most users probably don't care a lot).

Another approach would be to require users to choose what they want to do, but it would be much more annoying to use:

  • @skipmissing would use skipmissing∘f for single arguments, or a new variant of skipmissing for multiple arguments which passes a view of their non-missing entries; both would leave the returned value as-is
  • @passmissing would use passmissing(f) for AbstractArray inputs, or spreadmissing(f) for AbstractArray inputs, and would require the output to be an AbstractArray with its length equal to the number of non-missing entries in the input (or possibly broadcast a scalar result to non-missing positions)

f::F
end

function (f::SpreadMissings{F})(vecs::AbstractVector...) where {F}
if any(x -> x isa AbstractVector{>:Missing}, vecs)

firstindices = eachindex(first(vecs))
if !all(x -> eachindex(x) == firstindices, vecs)
err_str = "Indices of arguments do not match. Indices are " *
join(string.eachindex.(vecs), ", ", " and ") * "."

throw(ArgumentError(err_str))
end
Comment on lines +293 to +299
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this is an internal function I think we do not need it. DataFrames.jl enforces 1-based indexing.


nonmissingmask = fill(true, length(vecs[1]))
for v in vecs
nonmissingmask .&= .!ismissing.(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end

nonmissinginds = findall(nonmissingmask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrames._findall will be more efficient


vecs_counter = 1
newargs = ntuple(i -> view(vecs[i], nonmissinginds), length(vecs))

res = f.f(newargs...)

if res isa AbstractVector && length(res) == length(nonmissinginds)
out = similar(res, Union{eltype(res), Missing}, length(vecs[1]))
fill!(out, missing)
out[nonmissinginds] .= res

return out
else
return res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this design. The "appropriate length" part seems fragile. I understand the intention, but it seems incorrect somehow.

I feel that the best design would be to have two separate functions:

  • one leaving the result "as is"
  • the other explicitly requiring the vector to be of matching length

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, another complication is how the spreading works. DataFrames.jl has some fairly complicated rules regarding spreading scalar values across columns.

Let's say you do

@rtransform @passmissing z = mean(:y)

this would currently spread the mean across all rows. But maybe it's more logical to spread it across only non-missing rows. Unfortunately this would require re-implementing the DataFrames.jl spreading logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to resolve such things with JuliaData/DataFrames.jl#2794.
The challenge here, though, is that you most likely want to combine operations having different rows that are skipped in a single call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed if the result is an AbstractVector it should be required to have a length equal to the number of non-missing entries in the input. Relying on the exact size to choose the behavior is brittle (you could end up having the expected number of elements just by chance, and then your code would break). But relying on the type seems OK, it's very similar to broadcasting.

Let's say you do

@rtransform @passmissing z = mean(:y)

this would currently spread the mean across all rows. But maybe it's more logical to spread it across only non-missing rows. Unfortunately this would require re-implementing the DataFrames.jl spreading logic.

Well it would also be quite common for users to wish to fill all rows with the mean, including those with missing values. That happens for example if you want to create a group-level variable in a multi-level model.

end
else
return f.f(vecs...)
end
end

"""
spreadmissings(f)

Given a function `f`, wraps `f` so that a `view` selecting non-missing
values of `AbstractVector` argument is applied before calling `f`. After `f`
is called, if the result is a `AbstractVector` of the appropriate length,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define "appropriate length"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also comment exactly what happens otherwise.

is is "spread" across the length of the original inputs, imputing `missing` where
indices of inputs contain `missing` values.

Consider `spreadmissing(f)(x1, x2)`, where `x1` and `x2` are defined as

```
x = [1, missing, 3, 4]
y = [5, 6, 7, missing]
```

and the function

```
no_missing(x::Int, y::Int) = x + y
f(x::AbstractVector, y::AbstractVector) = no_missing.(last(x), y)
```

Then the indices for which both `x` and `y` are non-missing are `[1, 2]`.
`spreadmissings(f)` then calls `f(view(x, [1, 3]), view(y, [1, 3]))`. This will return
`[8, 10]`.

Finally, `spreadmissings(f)` will "spread" the vector `[7, 8]` across the non-missing
indices of the inputs, filling in `missing`, otherwise. As a result we have

```
julia> DataFramesMeta.spreadmissings(f)(x, y)
4-element Vector{Union{Missing, Int64}}:
8
missing
10
missing
```

If the result of `f(view()...)` is not a vector, the it is returned as-is.

### Examples

```
julia> x = [1, 2, 3, 4, missing, 6]; y = [11, 5, missing, 8, 15, 12];

julia> DataFramesMeta.spreadmissings(cor)(x, y)
0.3803061508332227
```


"""
spreadmissings(f) = SpreadMissings{Core.Typeof(f)}(f)

##############################################################################
##
## @with
Expand Down