Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve flatten (slightly breaking) #2767

Closed
bkamins opened this issue May 17, 2021 · 10 comments · Fixed by #3283
Closed

Improve flatten (slightly breaking) #2767

bkamins opened this issue May 17, 2021 · 10 comments · Fixed by #3283
Labels
breaking The proposed change is breaking. decision
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented May 17, 2021

Users find it confusing that strings are iterable, see https://discourse.julialang.org/t/flatten-in-case-column-contains-string-and-array/61284/5. They are not iterable in broadcasting. On the other hand we should not follow the full semantics of Base.broadcastable as it will then eg. disallow Dicts (this was the original reason why I went for iterator interface).

I think the way to go is to create a list of special cases. Here is a collection of standard "suspects that I think we should not flatten by default:

  • AbstractString
  • Pair
  • Markdown.MD

the question is if we want them to be left "as is" or throw an error. If we leave them "as is" then probably we should also consider what other types we want to leave "as is" (they currently error):

  • Symbol
  • Missing
  • Nothing

(I have listed only the most common cases)

The general question in the first place is: do we find it useful to complicate the API of flatten. My fist instinct is to throw an error on the cases we do not want to allow being flattened as this will be simpler.

@bkamins bkamins added decision breaking The proposed change is breaking. labels May 17, 2021
@bkamins bkamins added this to the 1.x milestone May 17, 2021
@bkamins
Copy link
Member Author

bkamins commented May 17, 2021

As proposed in #2766 we could:

  • error by default in the cases I have listed
  • add a kwarg that would leave "as is" all values that do not support length or form my list

@pdeffebach
Copy link
Contributor

I am much more concerned about supporting missing than AbstractString. Though I understand both are important, I think the probability of having a mixture of Vector{String} and String is unlikely.

@bkamins
Copy link
Member Author

bkamins commented May 17, 2021

Yeah - that is why I have listed all options to consider 😄. I also think missing will be much more common.

@sprmnt21 - thank you for raising this issue as it really is relevant.

@knuesel
Copy link
Contributor

knuesel commented May 17, 2021

Maybe missing should be handled orthogonally to this issue, with an allowmissing keyword?

I think errors on scalars (for example symbols) are unexpected: since numbers (the most common scalars) work, I would expect that every value is treated either as scalar or as collection, and errors would only occur when there is a "real" issue like missing.

We could have either a blacklist (AbstractString, Pair, ...) or a whitelist (AbstractArray, AbstractDict, Tuple, Generator...). In either case, it seems a bit arbitrary to have a fixed list, so maybe we should also have a keyword argument for the user to supply their own list? For example the user could call flatten(df, :a, scalars=[AbstractString]) to solve the issue in the Discourse thread.

@pdeffebach
Copy link
Contributor

pdeffebach commented May 17, 2021

I think we should take no methods for length seriously and not do a white-list. I'm not super sure about keyword arguments. This seems rate enough that people can do cleaning on their own before, i.e. put all Strings in [x].

Do we know if this problem came about "naturally" or was it trying to work through the docs? because the post on discourse certainly seems a bit contrived.

@sprmnt21
Copy link

sprmnt21 commented May 17, 2021

an example of origin, I don't know how "natural", of this situation is the dataframe originating from the following expressions

                  using DataFrames
        list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","5666TB:GHLA","789789TSB:ASLT","2324R:CDCL","79067P:OPRA","79067R:OPRA"])
        list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])
        
        function pairing(s1,s2)
                      if length(s1)==length(s2)
                          idx=findall(x->x!=0,cmp.(collect(s1),collect(s2)))
                          return Set([s1[idx],s2[idx]])
                      else
                          return Set([])
                      end
                  end
                  list = leftjoin(list1,list2, on = :ID)
                  nr=nrow(list)
                  list.twin=[[i, something(findfirst(x->x==1,[Set(["P","R"])==pairing(list.ID[i],list.ID[j]) for j in 1:nr]),i)] for i in 1:nr]
                  list.pair1=[coalesce(list.Pair[t[1]],list.Pair[t[2]])   for t in list.twin ]
                  list.pair_1=[coalesce(list.pair1[r],unique(list.ID[list.twin[r]])) for r in 1:nr]
                  sort(list,:Pair)

which was an attempt to resolve this problem

but, I also believe, that it is quite artificial and unlikely.

@knuesel
Copy link
Contributor

knuesel commented May 18, 2021

Good point about taking no length seriously.

I also haven't seen a real-life case... I guess flattening a heterogeneous column is uncommon. So maybe not worth adding a scalars or similar keyword. On the other hand, cleaning the data by adding wrappers would probably have much worse performance?

@ianqsong
Copy link

ianqsong commented May 19, 2021

Following the hint of Base.broadcastable, which uses axes, it leads to the implementation below.
Note that cell x would be flattened according to length(axes(x)[1]), i.e., only the outer most layer (bracket [ ] for vector, ( ) for tuple) would be removed (may be called flatten in depth level 1). Eg. the column with two items

[["aa", ["bb"]], "cc"]

is flattened to three items ["aa", ["bb"], "cc"].

function flatten_depth1(df::AbstractDataFrame,
                 cols::Union{ColumnIndex, MultiColumnIndex})
    _check_consistency(df)

    idxcols = index(df)[cols]
    isempty(idxcols) && return copy(df)
    col1 = first(idxcols)
    # lengths = length.(df[!, col1])
    flen(x) = try length(axes(x)[1]) catch; 1 end
    lengths = flen.(df[!, col1])
    if length(idxcols) > 1
        for col in idxcols[2:end]
            v = df[!, col]
            if any(x -> flen(x[1]) != x[2], zip(v, lengths))
                r = findfirst(x -> x != 0, length.(v) .- lengths)
                colnames = _names(df)
                throw(ArgumentError("Lengths of iterables stored in columns :$(colnames[col1]) " *
                                    "and :$(colnames[col]) are not the same in row $r"))
            end
	end
        sort!(idxcols)
    end

    new_df = similar(df[!, Not(cols)], sum(lengths))
    for name in _names(new_df)
        repeat_lengths!(new_df[!, name], df[!, name], lengths)
    end
    # length(idxcols) > 1 && sort!(idxcols)
    for col in idxcols
        # col_to_flatten = df[!, col]
        # flattened_col = col_to_flatten isa AbstractVector{<:AbstractVector} ?
        #     reduce(vcat, [i for i in j, j in df[!, col]]) :
        #     collect(Iterators.flatten(col_to_flatten))

        # [["aa", ["bb"]], "cc", ("dd", "ee")] flattened to ["aa", ["bb"], "cc", "dd", "ee"]
        flattened_col = mapreduce(vcat, df[!, col], lengths) do i,l
            l > 1 ? [j for j in i] : ifelse(isa(i, AbstractVector), i, [i])
        end
        insertcols!(new_df, col, _names(df)[col] => flattened_col)
    end

    return new_df
end

@bkamins bkamins modified the milestones: 1.x, 1.5 Aug 15, 2022
@bkamins bkamins mentioned this issue Aug 15, 2022
@bkamins
Copy link
Member Author

bkamins commented Oct 14, 2022

I have thought about it and would add scalar kwarg (we could discuss the name). It would accept a type and values of this type would be repeated as many times as needed (aka broadcasting). By default no type would be treated as a scalar. But for example if you pass scalar=Missing then missings would be expanded. If you pass scalar=Union{Missing, AbstractString} then missings or strings would be expanded.

What do you think?

CC @nalimilan

@nalimilan
Copy link
Member

I kind of regret we don't treat strings as scalars, but now that we do I agree that allowing to pass types to treat as scalars via an argument is a reasonable solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. decision
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants