Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add getrows #284

Merged
merged 6 commits into from
Aug 30, 2022
Merged

add getrows #284

merged 6 commits into from
Aug 30, 2022

Conversation

CarloLucibello
Copy link
Contributor

@CarloLucibello CarloLucibello commented Jun 28, 2022

src/Tables.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Jun 28, 2022

This PR should probably provide appropriate implementations for rowtable, columntable, dictrowtable, dictcolumntable, table as these are defined in Tables.jl (an additional benefit of doing this will be that package developers will see how to properly implement getrows for their own types).

Also probably it would be good to bump version number so that it can be released after merging.

@codecov
Copy link

codecov bot commented Jun 28, 2022

Codecov Report

Merging #284 (76f1375) into main (60c080c) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #284      +/-   ##
==========================================
+ Coverage   94.88%   94.94%   +0.06%     
==========================================
  Files           7        7              
  Lines         665      673       +8     
==========================================
+ Hits          631      639       +8     
  Misses         34       34              
Impacted Files Coverage Δ
src/Tables.jl 88.00% <100.00%> (ø)
src/namedtuples.jl 100.00% <100.00%> (ø)
src/matrix.jl 100.00% <0.00%> (ø)
src/fallbacks.jl 98.60% <0.00%> (+<0.01%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@CarloLucibello
Copy link
Contributor Author

This PR should probably provide appropriate implementations for rowtable, columntable, dictrowtable, dictcolumntable, table as these are defined in Tables.jl (an additional benefit of doing this will be that package developers will see how to properly implement getrows for their own types).

I'm totally unfamiliar with this package so I was hoping to just get the stub in to get things started. I can have a look around and see if I can manage to write those implementations as well.

@bkamins
Copy link
Member

bkamins commented Jun 28, 2022

I can have a look around and see if I can manage to write those implementations as well.

These should not be that hard. Of course we can do it later.

@quinnj, @nalimilan - do you think we can have some default fallback implementation of this method?

src/Tables.jl Outdated Show resolved Hide resolved
src/Tables.jl Outdated
Return one or more rows from table `x` according to the position(s) specified by `inds`:

- If `inds` is a single integer return a row object.
- If `inds` is a collection of integers, return a table object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If `inds` is a collection of integers, return a table object.
- If `inds` is a collection of integers, return an indexable object of rows

Just a bit more specific; one of the main motivations for this was to ensure users can get an in-memory/indexable collection of rows in a consistent way.

Copy link
Member

@nalimilan nalimilan Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we want to require Tables.getrows to return a indexable collection of rows? That would mean that e.g. for a NamedTuple of vectors getrows couldn't return another NamedTuple of vectors, even if that's the most efficient representation. IOW this is against the implementation that this PR adds for ColumnTable. :-)

src/namedtuples.jl Outdated Show resolved Hide resolved
src/namedtuples.jl Outdated Show resolved Hide resolved
Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start; thanks @CarloLucibello. Sorry I've been so slow on getting around to this or responding on the original issue. Just a little cleanup here and if you'll add some tests, I think it's ready to go. Once we merge, I can take a look at doing the implementation for the other table types in Tables.jl.

@CarloLucibello
Copy link
Contributor Author

Done! Can you unlock CI?

@CarloLucibello
Copy link
Contributor Author

good to go?

src/Tables.jl Outdated Show resolved Hide resolved
src/Tables.jl Outdated
Return one or more rows from table `x` according to the position(s) specified by `inds`:

- If `inds` is a single integer return a row object.
- If `inds` is a collection of integers, return a table object.
Copy link
Member

@nalimilan nalimilan Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we want to require Tables.getrows to return a indexable collection of rows? That would mean that e.g. for a NamedTuple of vectors getrows couldn't return another NamedTuple of vectors, even if that's the most efficient representation. IOW this is against the implementation that this PR adds for ColumnTable. :-)

@nalimilan
Copy link
Member

nalimilan commented Jul 1, 2022

@quinnj, @nalimilan - do you think we can have some default fallback implementation of this method?

Indeed, why not have getrows return a RowTable for row-oriented tables and a ColumnTable for column-oriented tables? For row-oriented tables, a simple solution would be to call Tables.rows, ensure the result is a vector (and if not call collect on it), and then call getindex or view on it. For column-oriented tables, call Tables.columns, and for each column ensure it is a vector (calling collect on it if necessary) and call getindex or view on it. In both cases, instead of calling collect it would be better in term of memory use to allocate a vector of the final size and fill it on the fly, but it's more tricky to do so maybe not worth it for a first fallback implementation.

EDIT: actually what I described is just calling getrows(Tables.rowtable(tbl), ...)) when Table.isrowtable(tbl) and getrows(Tables.columntable(tbl), ...)) otherwise. So that's quite trivial to implement.

@ablaom
Copy link
Contributor

ablaom commented Aug 2, 2022

@quinnj 🙏🏾

Co-authored-by: Milan Bouchet-Valat <[email protected]>
Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM; @bkamins, can I ask you to take one more look here before merging? Sorry to let this sit for so long; I've been tied up with a lot f stuff.

src/Tables.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Aug 28, 2022

I have left one comment about allowed inds + Tables.jl version needs to be bumped.

Co-authored-by: Bogumił Kamiński <[email protected]>
@quinnj quinnj merged commit d7cea3b into JuliaData:main Aug 30, 2022
@nalimilan
Copy link
Member

@quinnj What about throwing an error as discussed above?

@bkamins
Copy link
Member

bkamins commented Aug 30, 2022

I agree with @nalimilan that it is better to throw an error - at least for the reference implementations (so users, by looking at the code can see what is expected).

As I have commented above - I think the simplest thing to check is if the returned object is RowTable or ColumnTable respectively.

@quinnj
Copy link
Member

quinnj commented Sep 5, 2022

Ok, but here's an issue I've found digging into this further (and introducing the error checks mentioned); the current implementation of getrows for ColumnTable doesn't follow the getrows docs, since it returns a ColumnTable instead of return an indexable object of rows..

The point of this is we always get either a single row or collection of rows, right? Even if the input is column-oriented? Or should we update the docs and say if inds is a collection, you get a subset of original input, row or column oriented.

@quinnj
Copy link
Member

quinnj commented Sep 5, 2022

And indeed, @nalimilan pointed this exact issue out here.

One idea is that we could provide a generic ColumnsRows struct that would basically be a wrapper for any result of Tables.columns that provided an "indexable collection of rows" interface to the underlying columns (essentially iterating ColumnsRow that we already use in the generic Tables.rows fallback for column oriented tables).

I might take a stab at this approach and see what it looks like.

@bkamins
Copy link
Member

bkamins commented Sep 6, 2022

Or should we update the docs and say if inds is a collection, you get a subset of original input, row or column oriented.

This is what I think we should do. For example for DataFrames.jl I imagine that we will have the following implementation:

Tables.getrows(df::AbstractDataFrame, inds; view=nothing) = view === true ? view(df, inds, :) : df[inds, :]

as it seems most natural (and in general - it will be more efficient for a given table type to decide what to do; I assume that people will expect Tables.getrows to be efficient).


I would set the issue of "indexable collection of rows" and "indexable collection of columns" (for completeness as separate case) since it is orthogonal to the Tables.getrows design. I think we should provide such an API (again, in DataFrames.jl this API is provided by the eachrow and eachcol functions, but we could choose some other names here, as in Base Julia eachrow and eachcol were added later and they only support iteration interface but not indexing interface).

@quinnj
Copy link
Member

quinnj commented Sep 6, 2022

So given that we want this to be more of a general "subset" function, I'd like to change the name from getrows. 1) I think Tables.getrows is too similar to Tables.rows and it seems hard to explain the difference since they seem so similar and 2) for column-oriented tables, you're not really getting "rows" back, but you're just getting a subset of the original table.

I'm going to propose we call this Tables.subset to be more clear.

@bkamins
Copy link
Member

bkamins commented Sep 6, 2022

Tables.subset is very good for me as it is consistent in terminology with DataFrames.jl :).
@ablaom @ToucheSir @CarloLucibello - are you OK with this?

Alternatively we could have 2 functions:

  • Table.getrow to get a single row (and it would accept a single index)
  • Table.subset to get a subset of a table and return a table (and it would expect a collection of indices)

@quinnj
Copy link
Member

quinnj commented Sep 6, 2022

Ok, cleaned up implementation here: #292

@CarloLucibello
Copy link
Contributor Author

Tables.subset is a reasonable choice, fine for me

@ablaom
Copy link
Contributor

ablaom commented Sep 6, 2022

Mmm. I'm not super happy with removing "rows" from the the name. If I'm new to tables in julia, "subset" could mean anything. I similarly never liked name like "select" . Select what? I want to know which axis I'm slicing on.

How about rowsubset ?

@ablaom ablaom mentioned this pull request Sep 6, 2022
@quinnj
Copy link
Member

quinnj commented Sep 6, 2022

But that's kind of the point of our discussion/decision above: this function isn't necessarily returning "rows" if the original table input is column-oriented; rather, it returns a "subset" of the original table, preserving whether the original table was column or row-oriented. Thus, Tables.subset (note the Tables. prefix!) seems clearest to what the function is actually doing.

@ablaom
Copy link
Contributor

ablaom commented Sep 6, 2022

Yes, I get that it is not returning "rows" in the sense of not returning a row iterator. But I'm not sure general users equate "rows" with "row iterator", maybe just Tables.jl developers do that 😉 .

I also appreciate these name decisions are very difficult to get right and to get consensus on. Just putting in my two cents.

And I agree that aligning with DataFrames - which most users will know well - makes sense.

@ToucheSir
Copy link

One could always go maximally explicit with something like Tables.subset_along_row_dim, but I imagine folks would not be terribly enthused about that name :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants