Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables that are tuples #276

Open
ablaom opened this issue Mar 3, 2022 · 10 comments
Open

Tables that are tuples #276

ablaom opened this issue Mar 3, 2022 · 10 comments

Comments

@ablaom
Copy link
Contributor

ablaom commented Mar 3, 2022

I have encountered two thorny issues which have arisen because Tables.jl natively implements its interface for certain kinds of tuples. The conflicts arising are not quite the same, which suggests that even if workarounds are found, the underlying problem may reappear elsewhere, so I thought it worth inviting comment here. One issue is in a package I maintain but the other is not. The issues are here and here.

In Tables.jl the following is both a table and a tuple of tables:

X = ((a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7]))

This leads to ambiguity in the issues mentioned.

@darsnack @OkonSamuel

@bkamins
Copy link
Member

bkamins commented Mar 4, 2022

How do you propose to resolve it? In general, under current rules both are valid. Some data structure, conceptually, can be both a table and a collection of tables. What I mean is that you can construct an object that declares itself to be a table, but at the same time it can store tables inside - there is no contradiction.

In general e.g. a vector of NamedTuple is a table, not only a Tuple of them:

julia> X = [(a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7])]
2-element Vector{NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Int64}}}}:
 (a = [1, 3], b = [2, 3])
 (a = [2, 5], b = [4, 7])

julia> Tables.istable(X)
true

In this particular case the root cause of the problem is this definition I think:

function isiterabletable(x::T) where {T}
    isiterable(x) || return false

    if Base.IteratorEltype(x)==Base.HasEltype()
        et = Base.eltype(x)
        if et === Union{}
            return false
        elseif et <: NamedTuple
            return true
        elseif et===Any
            return missing
        else
            return false
        end
    else
        return missing
    end
end

which special cases NamedTuple eltype.

So, essentially - what you propose is to make Tables.istable return false if it is a vector/tuple of NamedTuple?

One more comment about Tables.istable:

Note that not all valid tables will return true, since
it's possible to satisfy the Tables.jl interface at "run-time",
e.g. a Generator of NamedTuples iterates NamedTuples,
which satisfies the AbstractRow interface, but there's no static way of
knowing that the generator is a table.

That is. Even if Tables.istable returns false it does not say that the object is not a table. The contract is one-way only. If true returned then it is a table; if false is returned then it does not tell you anything.

@ablaom
Copy link
Contributor Author

ablaom commented Mar 6, 2022

@bkamins Thanks indeed for your thoughts.

I didn't quite understand this comment:

In general e.g. a vector of NamedTuple is a table, not only a Tuple of them:

The object you mention is not a tuple. And this example does not cause problems in the cases I mentioned.

I don't have a concrete proposal, unless declaring all unnamed tuples to be non-tables is realistic option.

In the first issue, the case is made that tuples in Julia have a very special meaning, as they are the grouping type for the return value of "multi-valued" functions. There was a feeling that implementing Tables for Base tuple types is therefore a kind of piracy, I guess. They are chaining multiple transformers and want the option of processing multiple data containers at once. Here "multiple" is detected using tuple. An alternative is the use of varargs, but this breaks the nice (and natural) pipeline syntax they are using.

It would be nice to know if there are (unnamed) tuples, that are also tables, in common use.

@bkamins
Copy link
Member

bkamins commented Mar 7, 2022

The object you mention is not a tuple.

That is a point of my comment. Tuple is not treated in any special way by Tables.jl. It is a collection, just like a vector is a collection. What Tables.jl does is checking eltype of that collection. If eltype is a NamedTuple then the collection is a table.

Examples:

> t1 = ((a=1,b=2),)
((a = 1, b = 2),)

julia> eltype(t1)
NamedTuple{(:a, :b), Tuple{Int64, Int64}}

julia> Tables.istable(t1)
true

julia> t2 = [(a=1,b=2)]
1-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
 (a = 1, b = 2)

julia> eltype(t2)
NamedTuple{(:a, :b), Tuple{Int64, Int64}}

julia> Tables.istable(t2)
true

In summary - type of the container currently does not matter for Tables.jl. Only the type of elements stored in it matter.


Given your comments maybe you want to propose another rule:

Never treat Tuple to be a table and make Tables.istable return false for Tuple no matter what they store.

This could be acceptable, as it is not very likely that someone stores tables in tuples.
It would be mildly breaking, but maybe @quinnj could accept it.

Is introduction of such a rule something you would propose then?

@ablaom
Copy link
Contributor Author

ablaom commented Mar 7, 2022

@bkamins Thanks for clarifying the comment!

Given your comments maybe you want to propose another rule:

Never treat Tuple to be a table and make Tables.istable return false for Tuple no matter what they store.

I believe this would indeed resolve both the issues mentioned. Thanks for considering it!

@darsnack Do you agree this is a worthwhile proposal from the point-of-view of JuliaML/MLUtils.jl#61 ?

@nalimilan
Copy link
Member

It seems problematic to decide that tuples should be special-cased and never be considered as tables. What would be the rationale for this in Tables.jl? AFAICT this is a problem at JuliaAI/ScientificTypes.jl#182 because ScientificTypes treats tuples in a special way, but another package could have a similar request about vectors (and indeed it's not uncommon to have ambiguities between tables and other types such as vectors). Why not just add that restriction in ScientificTypes only?

@ablaom
Copy link
Contributor Author

ablaom commented Mar 13, 2022

Why not just add that restriction in ScientificTypes only?

@nalimilan Thanks for looking into this.

What would be the rationale for this in Tables.jl?

The rational is I don't see how either issue can be resolved outside of a Tables.jl restriction, as unnatural as that might seem from the point-of-view of the package considered in isolation.

How do you mean "add restriction" in ScientificTypes, exactly? We can add something to documentation, but any user trying to to get ScientficTypes to play with tuple-tables will get unexpected behaviour. In ScientificTypes a basic and very natural requirement is that the scitype of a tuple is the Tuple of the scitypes. I don't see how we can also require scitype(X) = Table{..} when isstable(X)==true if X might also be a tuple without problems.

What would be your proposal to resolve the issue in MLUtils.jl?

@nalimilan
Copy link
Member

I just mean that MLUtils could disallow using tuples as tables, i.e. throwing an error when getting one or treating them as non-table objects. This is basically the same thing as you propose doing in Tables.jl, but rather than disallowing this everywhere, only do so in MLUtils (where they are problematic).

@ablaom
Copy link
Contributor Author

ablaom commented Mar 15, 2022

@nalimilan Thanks. I will suggest, then, that those two packages treat all tuples as non-tabular objects, and document the fact.

@bkamins
Copy link
Member

bkamins commented Mar 15, 2022

@ablaom - I think (I am not sure - just guessing) that what @nalimilan wants to say is that in some contexts Tuples as tables can make a lot of sense. For example currently the following tuple is a table:

julia> ((a=1,b=2), (a=3, b=4))
((a = 1, b = 2), (a = 3, b = 4))

Then if you have such a table and want to perform some manipulations on it the output is also Tuple of NamedTuple. The point is that such operations are non-allocating. Which means that if user has millions of small tables that need to be processed this is the best way such operation can be expressed as:

  1. we avoid malloc
  2. we avoid GC (which in such contexts often kills performance)

Of course this is probably not a realistic scenario for ML applications but it is plausible for some transactional system that needs to response to queries in near real time (and wants to e.g. avoid GC hiccups).

Having said that, maybe there is a work-around for this use case that does not require Tuples to get the same effect.

@bkamins
Copy link
Member

bkamins commented Mar 15, 2022

@quinnj - how do you see this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants