Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start an idea of what an "in memory" requirement would look like #278

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Mar 12, 2022

This has been requested/discussed a number of times; here's one idea of
what this could look like. Basically the same as Tables.rows, but
Tables.indexablerows would require an "indexable" object of rows to be
returned instead of just an iterator. Indexable is a little vague; to be
most useful, we should probably require the return object to be
AbstractVector since we get lots of fancy indexing/useful behavior
that way. The bare minimum indexing interface is just getindex,
firstindex, and lastindex, but it seems like people would then just
be wanting to do x[[i, j, k]] like operations and have to implement
their own. So I'm inclined to make the requirement that you have to
return an AbstractVector of rows.

This has been requested/discussed a number of times; here's one idea of
what this could look like. Basically the same as `Tables.rows`, but
`Tables.indexablerows` would require an "indexable" object of rows to be
returned instead of just an iterator. Indexable is a little vague; to be
most useful, we should probably require the return object to be
`AbstractVector` since we get lots of fancy indexing/useful behavior
that way. The bare minimum indexing interface is just `getindex`,
`firstindex`, and `lastindex`, but it seems like people would then just
be wanting to do `x[[i, j, k]]` like operations and have to implement
their own. So I'm inclined to make the requirement that you have to
return an `AbstractVector` of rows.
@quinnj
Copy link
Member Author

quinnj commented Mar 12, 2022

Couple of additional thoughts:

  • I've changed RowIterator in this PR to subtype AbstractVector{ColumnsRow{T}}, which means that all column tables would "just work" automatically for Tables.indexablerows as is defined in this PR
  • Currently, if the result of Tables.rows isn't an AbstractVector, then we throw an error in the fallback definition. I wonder if it would be more useful to just call Tables.rowtable instead? Like, I'm curious what potential users of Tables.indexablerows would do if the input ends up not being indexable; do they want to surface that error back to the user so throwing is fine? Or would they rather just "coerce" the input to an indexable rows object via Tables.rowtable.

@jariji
Copy link

jariji commented Mar 12, 2022

Collecting a rowtable from an iterator can be a costly operation and I'd want to know when I'm doing it, so I'd rather get an error.

@bkamins
Copy link
Member

bkamins commented Mar 12, 2022

As commented on Discourse I think we should require Tables.indexablerows to support AbstractVector API but it does not have to be AbstractVector. I would say in the docs that it is strongly recommended but not required (like with Tables.istable - it is recommended but not required).

What I mean is that if you subtype AbstractVector then it is easy to create such a type (there is a short and well defined list of methods you need to define). If you decide not to subtype AbstractVector then you need to implement many methods from Base and I think most of the package maintainers will have a very hard time to do it properly and keep consistency with Base Julia when it evolves.

I would also require firstindex to return 1 (i.e. I think it is safer to require 1-based indexing)

However, if @juliohm and @ablaom will find it acceptable to just require AbstractVector I think it would simplify things.

CC @nalimilan

@nalimilan
Copy link
Member

Thanks for working on this @quinnj! I agree with @bkamins that it doesn't seem necessary/useful to require returned objects to subtype AbstractVector, as you're unlikely to dispatch on that: implementing the interface is enough. This is more in line with the fact that the result of Tables.columns is not required to subtype AbstractColumns. And in the Discourse discussion, having to subtype AbstractVector was mentioned as problematic by @juliohm.

Currently, if the result of Tables.rows isn't an AbstractVector, then we throw an error in the fallback definition. I wonder if it would be more useful to just call Tables.rowtable instead? Like, I'm curious what potential users of Tables.indexablerows would do if the input ends up not being indexable; do they want to surface that error back to the user so throwing is fine? Or would they rather just "coerce" the input to an indexable rows object via Tables.rowtable.

Given that currently the Tables.columns fallback already makes a copy if necessary to return a suitable object, I'd say that for consistency indexablerows should do the same. But maybe a function to check whether a table is known to support indexing would also be useful, similar to rowaccess/columnaccess? Then if you don't want to make a copy you could check this trait first. It would be worth mentioning in the docstring whether types are expected to make a copy of the data if needed or whether they'd better error in that case, as this is indeed a hard decision to make, in particular for table types pointing to potentially large out of memory datasets.

@juliohm
Copy link

juliohm commented Mar 13, 2022

Thank you all for putting this proposal together. I would like to double check with you if I am understanding things correctly.

Consider the following abstract Meshes.Data type that I rely on in Meshes.jl to represent tables over geospatial domains:

https://github.com/JuliaGeometry/Meshes.jl/blob/master/src/traits/data.jl

This abstract type implements the Tables.jl interface and the main idea there is the following: if a user of the ecosystem has data that can be mapped to a geospatial domain like a mesh, image, geographic map, he/she can simply inherit from the abstract Meshes.Data type and implement two functions: domain and values. The parent type will have fallbacks for plot recipes, geospatial APIs, etc.

On the other hand, when I call Tables.rows(data::Data) I return a Meshes.DataRows object. From what I understood, the proposal consists of requesting Meshes.DataRows <: AbstractVector, is that correct? If that is the plan, I think it should be fine.

@bkamins
Copy link
Member

bkamins commented Mar 13, 2022

From what I understood, the proposal consists of requesting Meshes.DataRows <: AbstractVector, is that correct?

At least this was my original proposal. Your table does not have to be AbstractVector, but the "row view" of it should and can be AbstractVector as they most likely anyway need to be different types. Still you can make Meshes.DataRows to also be a Tables.jl table if you want (most likely you do want it).

@quinnj
Copy link
Member Author

quinnj commented Mar 13, 2022

implementing the interface is enough

@nalimilan, @bkamins, the problem IMO is that I think it's going to be too hard/complex to expect users to correctly implement all the possible indexing operations. Maybe I'm wrong here and it's not actually too hard, but from a brief glance, they'd have to implement getindex(x, ::AbstractVector{<:Integer}), bool vector indexing, etc. It feels like requiring the subtype of AbstractVector would ensure better consistency. But then again, maybe I'm over-worrying and most will subtype AbstractVector anyway and those who can't are savvy enough to correctly implement indexing.....

@quinnj
Copy link
Member Author

quinnj commented Mar 13, 2022

I guess part of my hesitation there is the "incompleteness" of the full indexing interface; I understand the "basic" indexing interface, but that only allows single x[i]. Like, what are the full list of indexing methods I get for free when I subtype AbstractVector; because that's the list someone will have to implement manually if they don't subtype AbstractVector and I think we at least owe them what that full list that we expect to be implemented and that users can rely on.

# fallback for indexablerows if not overloaded explicitly
function indexablerows(x::T) where {T}
y = rows(x)
if y isa AbstractArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why AbstractArray here not AbstractVector?

@bkamins
Copy link
Member

bkamins commented Mar 13, 2022

I think we at least owe them what that full list that we expect to be implemented and that users can rely on.

That was my point above. This is doable, but it is super hard. The users would need to additionally handle things like: view, broadcasting, similar, copy! (and related functions), keys, pairs, values, ...

In my experience if someone would implement a non-AbstractVector type that would want to mimick AbstractVector it is ~ 1 year of relatively active use of the type to catch all cases.

In summary - while we do not have to expect AbstractVector I propose to expect it for now (this was my original proposal on Discourse) and if we get reasonable requests to lift this requirement it would be a non-breaking change anyway.

BTW - we should implement a proper method for NamedTuple of vectors for this as this is the major use case that is currently problematic (preferably in this PR).

@nalimilan
Copy link
Member

Yes, clearly it's simpler for table implementations to subtype AbstractVector to benefit from all the fallbacks defined by Base. But This is probably what they will do anyway. We don't have to require it.

In summary - while we do not have to expect AbstractVector I propose to expect it for now (this was my original proposal on Discourse) and if we get reasonable requests to lift this requirement it would be a non-breaking change anyway.

@bkamins Actually I think that if we require an AbstractVector result now, lifting that requirement later would be breaking. It's the reverse move which wouldn't be breaking. If we require subtyping AbstractVector now, Tables.jl users will be able to rely on this type (e.g. for dispatch). Of course the breakage would be limited to new types that would not subtype AbstractVector, but still.

Anyway, since it seems that @juliohm would actually be OK with having Meshes.DataRows <: AbstractVector, maybe we don't really need this new API, and can instead tell users who want to index rows to check whether Tables.rows(table) returns an AbstractVector? This is what @bkamins originally proposed on Discourse.

@bkamins
Copy link
Member

bkamins commented Mar 13, 2022

This is what @bkamins originally proposed on Discourse.

This was my original proposal and I am convinced to it if we are sure we are OK with requiring AbstractVector. However, if we would consider lifting this requirement in the future (even if we think we want to require it now) we would need another function.

In summary I will re-state my question from Discourse:

Does anyone object against requiring indexable rows-type to be AbstractVector subtype?

My 2cents when having a distinction could make sense (but I am not even sure if currently such table type exists so maybe it is only a hypothetical scenario).

Assume you have a table type that can serve rows via an iterator very cheaply. It potentially also can serve rows via indexable object but with some more overhead (but less than collecting the rows iterator). Then the user could choose if one wants a faster rows or a bit slower (but still faster than collect) indexablerows.

@ablaom
Copy link
Contributor

ablaom commented Mar 13, 2022

we would need another function.

Could we add a Tables.nrows method which falls back to returning Inf or nothing, but returns the number of rows in the "row-indexable" case?

@bkamins
Copy link
Member

bkamins commented Mar 13, 2022

Could we add a Tables.nrows

There are already nrow and ncol functions in DataAPI.jl for this so I do not think we should add another function that would serve the same purpose. What we could do is improve their docstrings. I have opened JuliaData/DataAPI.jl#45 to discuss this.

Also note that what you ask for is already defined as a part of iteration protocol. If iterator does not have a defined size it should return IsInfinite() or SizeUnknown() when IteratorSize is called.

From my perspective the point of this discussion (and this was reflected in my original proposal on Discourse) is not to define new concepts/types/functions for things that already covered in the data ecosystem as if we do it then it becomes increasingly difficult to:

  • keep consistency between them (developer's perspective);
  • learn them (user's perspective).

What I would propose is to have the following priority:

  1. Base Julia (i.e. whatever Base Julia defines should not be duplicated by anything else);
  2. DataAPI.jl and StatsAPI.jl (for their respective domain of coverage);
  3. Tables.jl (for tabular data);

I think it would be great to add packages from MLJ.jl into this list (to be explicit at what level of priority they fall). @ablaom - would it be only MLUtils.jl or there are also additional interface packages we should add?

The point is that if e.g. we want to change something in Tables.jl the maintainers responsibility is to make sure we do not duplicate something that is already defined in Base Julia or DataAPI.jl or StatsAPI.jl (this is exactly the case of nrow which is defined in DataAPI.jl already) or Base Julia (IteratorSize function for iterators is already there).

When I said "we would need another function" I meant that even having nrow defined or IteratorSize to HasLength or HasShape does not guarantee that the object is indexable - it only guarantees that the object has a defined size. Note that e.g. some iterators, even if their size is known cannot support random access. Example:

x = (rand() for i in 1:10)

is an iterator, it has length (10), it has shape, but it does not (and cannot) support random access.

@nalimilan
Copy link
Member

This was my original proposal and I am convinced to it if we are sure we are OK with requiring AbstractVector. However, if we would consider lifting this requirement in the future (even if we think we want to require it now) we would need another function.

AFAICT we don't need to require the result of Tables.rows to be a subtype AbstractVector: tables can do that if they can return an object that supports indexing. Users can check whether that's the case, and if not call Tables.columntable or Tables.rowtable depending on which would be the most suited for the operations they want to perform.

@ablaom
Copy link
Contributor

ablaom commented Mar 13, 2022

@bkamins Yes, I agree the suggestion is a poor one after all.

would it be only MLUtils.jl or there are also additional interface packages we should add?

I do think it would be good to add MLUtils.jl to the mix. (It is not part of MLJ, however, and I have little connection to it.)

DataLoaders.jl, a popular tool for handling out-of-memory datasets in the deep learning community, is based on the MLUtils.jl data container API (previously in LearnBase.jl). I'm thinking that if we can get MLUtils to play nicely with tables then MLJ to could carry out observation resampling through that interface, rather than through a mixture of the Tables and AbstractArray interface, as currently. This would be more natural, allow us to jettison the MLJ row-indexing methods for tables, and be a way for some MLJ models to support training on data that does not fit into RAM.

@tpapp
Copy link
Contributor

tpapp commented Mar 14, 2022

if someone would implement a non-AbstractVector type that would want to mimick AbstractVector it is ~ 1 year of relatively active use of the type to catch all cases.

A test suite (living in a dedicated package) could help with this. (This is just a general point, somewhat orthogonal to this discussion though).

Does anyone object against requiring indexable rows-type to be AbstractVector subtype?

It all depends on what it would be used for. If the users would expect to use it as vectors, then it is the right choice IMO.

@bkamins
Copy link
Member

bkamins commented Mar 20, 2022

@quinnj + @nalimilan I have chatted with @ablaom about MLJ.jl requirements for Tables.jl interface. Here is a summary of what I think is crucial.

@quinnj + @nalimilan => what do you think would be the best approach to support it (given the discussions we already had + maybe some additional new ideas are needed?). Thank you!

Preliminary

  • The solution should be generic and work on any table, no matter if it fits into memory or not.
  • It might be the case that some tables cannot support all the required API; then there should be a way to perform a conversion to another table type that would work (this potentially might lead to e.g. out of memory error, but I think it is acceptable as then user will see where the problem is); still probably some slow but safe fallbacks could be provided by custom out-of-core table types.

Use case 1

A typical case in ML is doing Cross Validation
Here it should be possible to get a CV fold that should preserve the access type (column/row) of the original table. Ideally this operation should be non-copying, but it is possible that in some cases it must copy, so this would be acceptable.

Use case 2

Accessing rows of table in a random order (e.g. when processing data in randomly selected batches).
This essentially means that MLJ.jl is able to implement the https://mldatapatternjl.readthedocs.io/en/latest/documentation/container.html#data-container API for any Tables.jl table. Essentially this requires that nobs and getobs (for a single and multiple indices) can be efficiently implemented.


"Use case 1" is a special case of "use case 2" but maybe it could support a more set of more efficient methods.
"Use case 2" might be possible only for in-memory tables, but even out-of-core tables might try to support it (or at least allow for some conversion after which it would be supported, see Preliminaries).

@nalimilan
Copy link
Member

Thanks for the detailed summary. So it looks like the interface defined here would satisfy these use cases. What's missing is that the fallback implementation should be changed a bit and the docstring be more explicit about some guarantees:

  • getindex should be implemented not only to access a single row, but also to take a subset of rows, in which case Tables.materializer should be used to return a table of the same type as the input.
  • This means that Tables.indexablerows cannot fall back to returning the result of Tables.rows, as it's currently not guaranteed to return a table of the same type when taking a subset (though we could change this requirement in the docs).
  • Instead of throwing an error when Tables.rows does not return an AbstractVector, the fallback could call Tables.columntable or Tables.rowtable (depending on the type of table) to ensure that it always works, even using lots of memory. I'm not really sure it's a good idea though: it may be better to have it throw an error, but provide a Tables.isrowindexable trait that you could check before calling Tables.rowindexable. Then it's easy to call Tables.columntable manually if you really want to proceed.

@bkamins
Copy link
Member

bkamins commented Mar 23, 2022

not guaranteed to return a table of the same type when taking a subset

I think it is enough that the same "mode" of the table is returned, i.e. column or row oriented.

In particular maybe we should also explicitly document that view should be supported (to avoid copying data if user prefers this, which I think will be the case most of the time).


@ablaom - can you please confirm that what @nalimilan described + my comment would fulfill your needs on high-level (then we should finalize the design in detail and submit it to another check with you).

@ablaom
Copy link
Contributor

ablaom commented Mar 23, 2022

Thanks @bkamins and @nalimilan .

Yes, I think what is described would fit my own needs.

Tables.indexablerows on multiple-index returns a table with the same row/column access (Use Case 1). In Use Case 2, I simply call Tables.rows first, to switch the access type, yes?

I agree that returning a table of the same type doesn't need to be a hard requirement. For a start, this seems inconsistent with @bkamins sensible suggestion that views are preferred, and views generally don't have the same type as the original table.

it may be better to have it throw an error, but provide a Tables.isrowindexable trait that you could check before calling Tables.rowindexable. Then it's easy to call Tables.columntable manually if you really want to proceed.

This is probably a question for the designers. But I'm curious what precisely isrowsindexable(X) is going to guarantee. I mean, it can never be a solid guarantee that I won't run out of memory, because that is not absolutely knowable. But what other kind of error could I expect?

@nalimilan
Copy link
Member

Tables.indexablerows on multiple-index returns a table with the same row/column access (Use Case 1). In Use Case 2, I simply call Tables.rows first, to switch the access type, yes?

Well, no, you'd call Tables.indexablerows in both cases. The point of adding this API is that Tables.rows is (currently) not guaranteed to be indexable.

This is probably a question for the designers. But I'm curious what precisely isrowsindexable(X) is going to guarantee. I mean, it can never be a solid guarantee that I won't run out of memory, because that is not absolutely knowable. But what other kind of error could I expect?

Here's one. :-) https://github.com/JuliaData/Tables.jl/pull/278/files#diff-12abce6071269cefb726c033c3832f2e0326e3a2fe769e67323a1b5ac0772483R111

@ablaom
Copy link
Contributor

ablaom commented Mar 24, 2022

Well, no, you'd call Tables.indexablerows in both cases. The point of adding this API is that Tables.rows is (currently) not guaranteed to be indexable.

Perhaps there is a misunderstanding. I mean, if I'm in use case 2 and I want fast row indexing, then I calll Tables.indexablerows(Tables.rows(X)). In case 1, I just call Tables.indexablerows(X) because I want to preserve row/column mode of the result in that case.

Here's one. :-) https://github.com/JuliaData/Tables.jl/pull/278/files#diff-12abce6071269cefb726c033c3832f2e0326e3a2fe769e67323a1b5ac0772483R111

Ha ha. You are just pointing to an error thrown by the current proposal. I mean, if instead of throwing an error, the fallback just calls collect on the result of Tables.rows (when that result is not AbstractVector) what error (apart from memory overflow) could I expect? If only that kind of error is possible, then why try to catch it? Unless it's possible that no error is thrown but unexpected behaviour could result. Or perhaps, the point is the an out-of-memory error might eventually be thrown, but there could be a long delay one would prefer to avoid? I guess that makes sense. Maybe issue a warning instead??

I'm just thinking aloud here and of course happy for you to settle this detail.

@nalimilan
Copy link
Member

Perhaps there is a misunderstanding. I mean, if I'm in use case 2 and I want fast row indexing, then I calll Tables.indexablerows(Tables.rows(X)). In case 1, I just call Tables.indexablerows(X) because I want to preserve row/column mode of the result in that case.

I see. But Tables.rows isn't supposed to return an object for which row indexing will be more efficient. All it guarantees is that it supports the AbstractRows API. For example, with a DataFrame object, it Tables.rows returns a DataFrameRows which is just wrapper around the DataFrame, thus providing relatively efficient row iteration, but not faster than using eachrow directly on the DataFrame.

If we want to support different performance variants, maybe that should be done via an argument to Tables.indexablerows which would allow specifying whether you want the fastest possible object, even if that implies copying or recompiling some functions to specialize on column types. For example for a DataFrame that would involve returning a type-stable NamedTuple of vectors, without any copy but with specialization (and thus some compilation overhead). Is that the kind of performance you're looking for, or would the performance of DataFrame indexing be enough for you?

Ha ha. You are just pointing to an error thrown by the current proposal. I mean, if instead of throwing an error, the fallback just calls collect on the result of Tables.rows (when that result is not AbstractVector) what error (apart from memory overflow) could I expect? If only that kind of error is possible, then why try to catch it? Unless it's possible that no error is thrown but unexpected behaviour could result. Or perhaps, the point is the an out-of-memory error might eventually be thrown, but there could be a long delay one would prefer to avoid? I guess that makes sense. Maybe issue a warning instead??

My point was that calling collect(Tables.rows(t)) (or Tables.rowtable(t) or Tables.columntable(t), which will probably give a more efficient result) is really easy to do if you know you really want it. Going out of memory can sometimes kill Julia or other processes (or make the system hand as you say). But I realize that Tables.columns(t) will already make a copy when passed a row-oriented table so I guess having Tables.rowindexable would be consistent with that. Let's hear what @quinnj thinks since he wrote that line throwing an error.

@ablaom
Copy link
Contributor

ablaom commented Mar 25, 2022

For example, with a DataFrame object, it Tables.rows returns a DataFrameRows which is just wrapper around the DataFrame, thus providing relatively efficient row iteration, but not faster than using eachrow directly on the DataFrame.

Sorry, I don't understand:

julia> eachrow(df) === Tables.rows(df)
true

Maybe you mean copy.(eachrow(df)) is faster than Tables.rows(df)?

But, yes, an option to get the fastest row-indexable object would be nice. I couldn't say if it is necessary without doing a lot of benchmarking. Is this the sort of thing you'd like to see before we proceed?

@nalimilan
Copy link
Member

I just said "not faster", because eachrow(df) isn't super fast either if you want to access rows many times (compared with a named tuple of columns in a function specialized on its type). I don't think benchmarking is needed, I just wonder what are your general expectations. For example, what are you doing currently in MLJ? Do you simply index rows of a DataFrame?

@bkamins
Copy link
Member

bkamins commented Mar 25, 2022

As a side comment: eachrow(df) (and currently Tables.rows which calls it) is slow on purpose (to avoid compilation of very wide tables), and if you want type-stable version Tables.rowtable should be used.

In general Tables.rowtable and Tables.columntable respectively, if I understand the design intentions correctly correctly, are added exactly because if someone has a generic "table" and wants to turn it into in-memory table that is fast (i.e. type stable) and is row- or column- oriented respectively then you have a consistent way to achieve this.

@ParadaCarleton
Copy link

Could we get something like this merged before the start of GSoC projects? I know some students are interested in building a panel data interface, and I think that gets significantly easier if you can do basic Split/Apply/Combine methods on Tables.

@quinnj
Copy link
Member Author

quinnj commented Apr 26, 2022

Ok, jumping back into things here; sorry for the delay; personal life has been a bit crazy lately.

  • getindex should be implemented not only to access a single row, but also to take a subset of rows, in which case Tables.materializer should be used to return a table of the same type as the input.

Hmmm, I didn't come to this conclusion reading @bkamins summary here. To me, it sounds like it's just desirable to have a Tables.rows-like thing (i.e. Tables.indexablerows) that's as efficient as possible, but not necessarily the same type as the original input. I actually think it would be fairly hard/rare for table types to support getindex w/ multiple indices AND return the same type as the original. i.e. that's excluding view from working on a vector of namedtuples.

This is yet another reason why requiring the result of Tables.indexablerows to subtype AbstractVector, since we get these kind of efficient "views" for free and they're efficient. As defined in this PR, we also get this "for free" for any column-oriented table source, since RowIterator can easily subtype AbstractVector and views on it will be efficient/non-copying.

That said, I don't feel strongly about requiring the subtyping, so I'm fine leaving that door open for now, though strongly recommending it and noting the minimum required methods to be implemented (which would be more than the currently defined Indexable interface in Base, which only requires single-int getindex).

The final point I'm still considering is whether we throw an error by default, or call indexablerows(columntable(x)) in the fallback case if the result of rows(x) isn't a subtype of AbstractVector. I'm leaning towards not throwing an error, since it would be slightly more convenient for users, and we can try to be clear of the ramifications of that in the docs. In particular, the use of Tables.indexablerows should follow a flow of:

  1. call irows = Tables.indexablerows(x) on the input x
  2. do things with irows repeatedly; getindex, etc.

What should be avoided, is repeatedly calling Tables.indexablerows(x) over and over, since it can potentially be a costly conversion. Just call it once, then re-use that indexable rows object for subsequent operations. @ablaom, it seems like that would work well in your case? We could rely on the generally efficient columntable(x) call to convert table inputs who don't natively support Tables.indexablerows and then you would use the result of indexablerows to do repeated partitioning/CV worklfows.

If that all sounds good, I'll clean this PR up: add some tests, lots of doc details, and we can merge in the next few days.

@nalimilan
Copy link
Member

  • getindex should be implemented not only to access a single row, but also to take a subset of rows, in which case Tables.materializer should be used to return a table of the same type as the input.

Hmmm, I didn't come to this conclusion reading @bkamins summary here. To me, it sounds like it's just desirable to have a Tables.rows-like thing (i.e. Tables.indexablerows) that's as efficient as possible, but not necessarily the same type as the original input.

IIRC I derived this part from the sentence "Here it should be possible to get a CV fold that should preserve the access type (column/row) of the original table." But I'm not sure why MLJ needs this. At least it seems that the exact type could be different as long as orientation is preserved.

@bkamins
Copy link
Member

bkamins commented Apr 26, 2022

What I think is needed is if you select multiple rows is that efficient access style (rows or columns) does not change. I think that requiring type not to change would be too strict.

The point is that if e.g. the source table had fast column access then after getindex on multiple rows:

  1. the resulting object should still be a Tables.jl table
  2. the resulting object should have fast column access

(and the same for rows)

@quinnj
Copy link
Member Author

quinnj commented Apr 27, 2022

2. the resulting object should have fast column access

Hmmm, this seems a little paradoxical; i.e. you're calling Tables.indexablerows on an input, which returns an indexable object of rows, so it necessarily can't provide column access since it's just rows. For example, CSV.File is column-oriented, but would rely on the fallback in this PR which returns Tables.RowIterator, which is an indexable object of rows. But utlimately the original column-access is preserved, since you could do a view of a Tables.RowIterator, which is a lazy "view" itself into the original column arrays. So I think this satisfies what you're mentioning here, but just wanted to make sure.

@bkamins
Copy link
Member

bkamins commented Apr 27, 2022

So I think this satisfies what you're mentioning here, but just wanted to make sure.

@ablaom - can you confirm?

To my understanding the requirement is that if the original object allowed you to do a fast extraction of a column of a table then the subsetted object should retain this property (it does not have to be the same type).

@ablaom
Copy link
Contributor

ablaom commented Apr 28, 2022

To my understanding the requirement is that if the original object allowed you to do a fast extraction of a column of a table then the subsetted object should retain this property (it does not have to be the same type).

Yes, this is what I have been looking for, to address use-case 1. However, I am realising a little late that the "fast extraction of columns" property is not the same thing as "columnaccess(X) == true" (and similarly for "rowaccess"). And my confusion extends to my conceptualisation of the ML workflows which are motivating my request. 😢 . Over the next couple of days I will rethink this and hopefully clarify our needs better.

@ablaom
Copy link
Contributor

ablaom commented May 1, 2022

Thanks to all for your continued patience, and for continuing to check in with my requirements.

After reflecting on my confusion, I should like to tweak my requirements for Use Case 1. When I subsample the rows of X to obtain table Xsub then I want:

A. Xsub has the same type as X. If this is not possible, return a julia native table with fast column access if rowaccesss(X)==true; otherwise return a julia native table with fast row access. (I suppose these are columntable and rowtable(X) respectively.)

Here's is more on the rational for requirement A.

When a data scientist chooses a table type, she chooses it for a reason. That is, the table type reflects some desired characteristics such as:

  • well-suited to a model to be trained (our Use Case 1), for example:
    • need fast column access
    • need specific table type, such as TimeArray for timeseries
  • familiar API
  • how the table is displayed

Very often you don't want observation subsampling - a very basic operation in ML - to break those characteristics. That is, preserving table type takes precedence over cost of the operation (which, compared to training models, is not usually an issue). Note that in MLJ, cross-validation is automated. So, at present there is no interface for "correcting" the type of a table whose model-specific requirements are being broken by row subsampling.

In TableTransforms.jl (which is more about columns than rows) the design choice is to always return the output using materializer (to get the original type) and this seems to work well, and is what most users expect, I am sure.

Generally views of a table change the type (and hence properties of the table we might want to preserve). However, I think it is useful to expose them and suggest we add an option to specify view_okay with this behaviour:

B. (view_okay==true) If a view is available for tables of type X, then Xsub should be such a view. If a view is not available, then fallback to case A.

I might be wrong, but it seems to the current PR is not able to address Use Case 1, because it does not sample rows directly but creates a new representation for which row subsampling has certain properties. So I see this PR as about meeting Use Case 2 (and Use Case 1 is not just a special case of Use Case 2). What is wrong with adding a selectrows(X, I, views_okay=false) method that a new table type can choose to implement (I a vector of integers)? This seems natural and conceptually simple to me.

As I say, the current PR seems more about Use Case 2. For that I'm not sure I understand the need for extension of the Tables API. As @bkamins has suggested earlier, I can test if Tables.rows(X) is an AbstractArray or not, and if not: collect, coerce to new table type, or throw an error, according to what I know about the table's "finiteness"" (see below).

Number of rows. In both use cases, one needs to know the number of rows before executing the subsampling. Can we have some way for the user to determine table "finiteness"? There seem to be three cases:

  • One can make no guarantee that the table can be materialised in memory.
  • The table is already in-memory, I can call nrow on it, but doing so may be expensive.
  • The table is already in-memory, I can call nrow, and that operation is cheap.

@quinnj
Copy link
Member Author

quinnj commented May 4, 2022

A. Xsub has the same type as X. If this is not possible, return a julia native table with fast column access if rowaccesss(X)==true; otherwise return a julia native table with fast row access. (I suppose these are columntable and rowtable(X) respectively.)

Couple comments:

  • Xsub has the same type as X: as I personally consider the various implementations I'm aware of in the Tables.jl ecosystem, I actually can't come up with a scenario where you would be able to do this efficiently. In every case, to efficiently do some kind of subset, you would want some kind of "view" object into the underlying table source. This leads me to believe that we shouldn't really reach for this as a hard requirement. Or at least realistically realize that it would be the rare case where a type could be efficiently subset to return the same type as original.
  • "fast column access if rowaccess(X) == true": do you mean "columnaccess(X)==true"? Otherwise, I'm not really following the logic here between switching the natural orientation.

Based on additional comments, I'm wondering if we want a different kind of interface all together here. Let me stream-of-conscious talk out loud here and see if it ends up making sense.

The "problem" w/ the current PR is that in the case of columnaccess(X) of the original table source, we end up with Tables.RowIterator after calling Tables.indexablerows(x), which loses the columnaccess property. Which, in turn, I think motivates the comment that we'd like the subsetted "thing" to be of the same type as the original. As I mentioned above, I just really don't think that's realistic, at least if you want things to be efficient, so let's consider an alternative:

Tables.subset(x, I), where x is any original table source, and I..... I guess could just be a Vector{Int} indicating the indices to subset from the original. For row-oriented sources, it would be something similar to this PR, and maybe we'd just define a default definition like:

function subset(x, I::Vector{Int})
    if columnaccess(x)
        return ColumnSubset(x, I)
    else
        return x[I]
    end
end

So for rows, we'd fallback to getindex(x, I), which would work if the source is <: AbstractArray. I guess we could also do the check if <: AbstractVector and if so, call getindex, otherwise, call rowtable(x)[I].

For column sources, we'd have a helper object defined like:

struct ColumnSubset{X}
    x::X
    inds::Vector{Int}
end

Tables.istable(::Type{<:ColumnSubset}) = true
Tables.columnaccess(::Type{<:ColumnSubset}) = true
Tables.columnnames(x::ColumnSubset) = columnnames(x.x)
Tables.getcolumn(x::ColumnSubset, i) = getcolumn(x.x, i)[x.inds]

So here we see that for any column oriented source,Tables.subset would return a ColumnSubset object, which lazily applies the subset indices when getcolumn is called. Now, in terms of interface integrity, this would rely on the assumption that individual column objects support getindex(::T, ::Vector{Int}). This hasn't quite been a formal requirement, but we do mention that column objects need to be "indexable", but I think we only meant at least supporting getindex(::T, i::Int) and length(x). Now, in practice, I'm not aware of any column sources that don't return columns that are <: AbstractVector, so I think this approach would end up working pretty well in practice.

Let me know what you all think and if I've missed anything obvious (probably have, it's a bit late here for me and as I mentioned, this was a live, coding "writeup" as I thought through this).

@ToucheSir
Copy link

In the spirit of trying to dedup our work across the data ecosystem, subset and its return type(s) are very similar to getobs and ObsView in MLUtils. Perhaps this functionality could be pulled even further up the stack into a library like DataAPI?

@ablaom
Copy link
Contributor

ablaom commented May 4, 2022

Thanks @quinnj

I actually can't come up with a scenario where you would be able to do this efficiently.

As I detail in my previous comment, I care about more preserving the properties of the original table than about efficiency of the subsetting operation (in Use Case 1). My performance bottlenecks are downstream of the subsetting.

If others view my requirements as too niche, then a table with the same access type is preferred. Ideally there would be an option to ensure this access (column or row) is fast (with a copy if necessary).

do you mean "columnaccess(X)==true"?

Yes, my bad 🙄

@bkamins
Copy link
Member

bkamins commented May 5, 2022

If others view my requirements as too niche, then a table with the same access type is preferred.

I feel that:

  1. this should be something that Tables.jl tries to preserve as a minimum.
  2. if some specific type wants to preserve "type" (not only access type) then it should add methods as needed to ensure this.

@aplavin
Copy link

aplavin commented May 5, 2022

Xsub has the same type as X: as I personally consider the various implementations I'm aware of in the Tables.jl ecosystem, I actually can't come up with a scenario where you would be able to do this efficiently. In every case, to efficiently do some kind of subset, you would want some kind of "view" object into the underlying table source. This leads me to believe that we shouldn't really reach for this as a hard requirement. Or at least realistically realize that it would be the rare case where a type could be efficiently subset to return the same type as original.

If "the same type as original" requirement is understood in a somewhat relaxed way, then this is already fullfilled by lots of in-memory Tables implementations, even with the same interface. For example, view(tbl, indices) on a Vector-of-NamedTuples doesn't return a Vector - but a SubArray, with exactly the same interface as the original. Same with other Tables, such as StructArray or TypedTables: they all use view() in the same sense, and the result has almost the same type as the original Table.

Maybe, other Tables implementations can also support view() with the same meaning? Are there any table types where view(tbl, indices) works and means something different from subsetting? The one case where it's clearly not possible to support this interface is NamedTuple-of-Vectors, because the type is defined in Base.

@quinnj
Copy link
Member Author

quinnj commented May 6, 2022

Ok, so correct me if I'm wrong, but it sounds like my latest proposal hits the right points that have been brought up; namely, the plan is:

  • A new Tables.jl utility function Tables.subset that allows returning a subset of a Tables.jl-compatible source
  • Tables.jl sources are free to overload this to provide a custom subset implementation
  • A generic fallback would work for AbstractVector row sources and column-oriented sources

So a source could overload to provide the same type in the subset operation, but the generic fallback would at least preserve access type.

My one qualm/question is what to do about the 2nd argument for Tables.subset(x, I). We could keep it simple and restricted to start and just require Vector{Int}, but I believe AbstractVector also supports a few other forms, like AbstractVector{Bool}, AbstractRange, etc. We could probably even support having a f(row) -> Bool functor as well.

I guess I'm mainly wondering if there are enough variations there that we're not comfortable with making the API complex for implementations that we just keep it simple for now. @bkamins, what you do think on this point?

@bkamins
Copy link
Member

bkamins commented May 6, 2022

@bkamins, what you do think on this point?

You have hit an excellent spot here as I have spent much time thinking about it when designing DataFrames.jl 😄.

This is what I think:

  1. define a Tables.filter (that does not extend Base.filter) that takes a predicate as a first argument assuming it is passed a single row of observations. Essentially it would be the same as Base.filter, but it would be non-exported and separate function to avoid potential method conflicts (and many tables would just use definition like Tables.filter(f, x::AbstractDataFrame) = filter(f, x) - this is what we would do in DataFrames.jl); optionally Tables.filter could allow view::Bool kwarg allowing to decide if user wants a view or a copy
  2. define indexing syntax. The point is that we need to distinguish passing a single Int (returning a row) and a collection like Vector{Int} and all other variants (returning a table). We have two ways to go here:
    a. Base Julia way, by defining Tables.length, Tables.getindex and Tables.view (again new functions just with the same names as corresponding functions in Base Julia)
    b. MLJ.jl way, by defining numobs, getobs and obsview (then probably https://juliaml.github.io/MLUtils.jl/dev/ should import these methods from Tables.jl as a more fundamental package)

Regarding filter I am pretty convinced this should work the way I described. Regarding indexing - we need to discuss if we prefer Base Julia or MLJ.jl naming convention. @ablaom - what do you think would be a better choice here?
My intuition is that using Base Julia way is a bit better as many users of Tables.jl will not do ML but just data manipulation, so it is more natural to use Base Julia conventions. The point of doing separate length, getindex and view functions in Tables.jl is then as for filter - that these methods might already be defined for the table type in other way for Base Julia functions, but using Tables.* prefix we signal that we want to look at these objects as a collection of rows (no matter how these objects view themselves outside of Tables.jl, e.g. DataFrame is two-dimensional so it does not define length and getindex and view take two arguments, while it would define Tables.length(df::AbstractDataFrame) = nrow(df), Tables.getindex(df::AbstractDataFrame, idx) = df[idx, :] and Tables.view(df::AbstractDataFrame, idx) = view(df, idx, :) so this is super simple to add).

@ablaom
Copy link
Contributor

ablaom commented May 8, 2022

b. MLJ.jl way, by defining numobs, getobs and obsview (then probably https://juliaml.github.io/MLUtils.jl/dev/ should import these methods from Tables.jl as a more fundamental package)

Just to clarify, these names come from MLUtils.jl (originally LearnBase.jl) which presently has nothing at all to do with MLJ. (MLJ does have a method called selectrows(X, I) which was originally intended as a private implementation of what Tables.jl is currently lacking. However, it has been a challenge to avoid using it "publicly". )

Regarding indexing - we need to discuss if we prefer Base Julia or MLJ.jl naming convention. @ablaom - what do you think would be a better choice here?

Mimicking the Base terminology sounds like a good idea, for the reasons you state. Lot easier for me to remember too.

@CarloLucibello
Copy link
Contributor

CarloLucibello commented May 17, 2022

@bkamins' proposal of a Tables.length, Tables.getindex, Tables.view seems fine. It could also be Tables.numrow, Tables.getrow, Tables.rowview if we want to be more explicit.

@ParadaCarleton
Copy link

I think numrow or nrow (to align with R) would be better. It might be a problem to have length(Table) behave differently from length(array) (since the latter flattens the array).

@ablaom
Copy link
Contributor

ablaom commented May 18, 2022

length(Table) behave differently from length(array) (since the latter flattens the array).

It's a fair point (NamedArrays come to mind) but I personally think this is okay because the proposal is to implement Tables.length, not overload Base.length.

@nalimilan
Copy link
Member

nalimilan commented May 20, 2022

Maybe it would be better to use explicit names rather than mimicking the Base terminology with Tables.length, Tables.getindex, Tables.view. Indeed, implicitly assuming that Tables.jl objects are collections of rows could be slightly confusing, e.g. one may misread Tables.getindex(Tables.columns(t), 1:3)) as taking a subset of columns. The situation with row- vs. column-oriented tables and the related API is already relatively complex to grasp for users, so being explicit may be better.

Note that we already have Tables.rowcount so we could probably reuse that to get the number of rows; DataFrames.jl uses nrow and it's defined in DataAPI so we could also use that. To take a subset of rows, I'd tend to prefer to avoid the name "select" as it's used for columns in SQL and DataFrames.jl. We need something that allows distinguishing copies and views, like Tables.getrows and Tables.viewrows (Tables.rowview looks inconsistent as it reverses the order compared to getrows).

@bkamins
Copy link
Member

bkamins commented May 20, 2022

After some discussion with @nalimilan I propose the following:

  • define a single function Tables.getrows(t, inds; view::Union{Bool,Nothing}=nothing) that would
    • if inds is a single integer return a row and if inds is a collection return a table
    • if view=nothing (the default) then the table t is free to decide whether to return a copy or a view based on what is more efficient; if view=true then view is returned and if view=false copy is returned; this applies both to returning a row or a table; I think it would be enough to request that default view=nothing must be supported and view=true or view=false can be an opt-in (i.e. implementations might error on them if they are not supported as sometimes this could be the case); the point is that in most use-cases the source table is usually not mutated after running getrows so users probably are interested in "most efficient" output (that is why this option must be supported and is default)
  • use DataAPI.nrow as this is a defined interface for getting number of rows of a table (Tables.rowcount is an internal undocumented function of Tables.jl); btw. since DataAPI.jl also defines ncol as number of columns of a table maybe we could also agree to use it (this is not related to this discussion but for consistency maybe we should settle this)

@bkamins
Copy link
Member

bkamins commented May 23, 2022

@quinnj - would you have time to have a look at this PR and finalize it (we can then follow it up in DataFrames.jl 1.4 release that we plan to have soon to have packages in sync). Thank you!

@CarloLucibello
Copy link
Contributor

Can we just start by defining the function getrows as proposed in #278 (comment) and with no fallback methods, so that downstream packages can start implementing their own? I'm not familiar with the internals of Tables.jl but I can file a PR if needed

@CarloLucibello CarloLucibello mentioned this pull request Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet