Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kwarg to filter columns #412

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

JoaoAparicio
Copy link
Contributor

Currently we don't have the option to load just a subset of the columns. This matters e.g. when compression is the bottleneck.

For example, create a compressed arrow file.

using Arrow
p = tempname();
N = 1000000
tbl = (
    a=rand(N),
    b=rand(N),
    c=rand(N),
    d=rand(N),
    e=rand(N),
    f=[rand(rand(0:100)) for _ in 1:N],
);
Arrow.write(p, tbl; compress=:zstd);

Column f is the longest - it has an expected 50*N elements vs N for the rest Some times we only care for some of the other columns. Currently we must decompress all columns regardless:

using BenchmarkTools
@btime tbl = Arrow.Table(p);  # 359.205 ms (530 allocations: 794.23 MiB)

With this commit we can load only some of the columns

@btime tbl = Arrow.Table(p; filtercolumns=["a"]);  # 6.146 ms (231 allocations: 14.33 MiB)

Currently we don't have the option to load just a subset of the columns.
This matters e.g. when compression is the bottleneck.

For example, create a compressed arrow file.

```julia
using Arrow
p = tempname();
N = 1000000
tbl = (
    a=rand(N),
    b=rand(N),
    c=rand(N),
    d=rand(N),
    e=rand(N),
    f=[rand(rand(0:100)) for _ in 1:N],
);
Arrow.write(p, tbl; compress=:zstd);
```

Column `f` is the longest - it has an expected 50*N elements vs N for the rest
Some times we only care for some of the other columns. Currently we must
decompress all columns regardless:
```julia
using BenchmarkTools
@Btime tbl = Arrow.Table(p);  # 359.205 ms (530 allocations: 794.23 MiB)
```
With this commit we can load only some of the columns
```julia
@Btime tbl = Arrow.Table(p; filtercolumns=["a"]);  # 6.146 ms (231 allocations: 14.33 MiB)
```
@JoaoAparicio
Copy link
Contributor Author

#340
#353

@JoaoAparicio
Copy link
Contributor Author

Converting this to draft as I'm working on something that will supersede this.

@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2023

Codecov Report

Merging #412 (bc9169e) into main (787768f) will decrease coverage by 1.67%.
The diff coverage is 15.58%.

@@            Coverage Diff             @@
##             main     #412      +/-   ##
==========================================
- Coverage   87.45%   85.78%   -1.67%     
==========================================
  Files          26       26              
  Lines        3283     3356      +73     
==========================================
+ Hits         2871     2879       +8     
- Misses        412      477      +65     
Files Coverage Δ
src/table.jl 81.97% <15.58%> (-10.52%) ⬇️

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@JoaoAparicio JoaoAparicio marked this pull request as ready for review November 4, 2023 01:43
@JoaoAparicio
Copy link
Contributor Author

Does anyone wanna re-run CI? Looks like macos got stuck

@kou
Copy link
Member

kou commented Nov 27, 2023

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants