Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: Add parquet arrow async write/read support #4868

Closed
3 tasks done
Xuanwo opened this issue Jul 8, 2024 · 10 comments · Fixed by #4980
Closed
3 tasks done

idea: Add parquet arrow async write/read support #4868

Xuanwo opened this issue Jul 8, 2024 · 10 comments · Fixed by #4980
Assignees

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Jul 8, 2024

We can implement native parquet arrow support to make our users happy:

By implementing this integration, users can avoid using the low-performance AsyncRead/AsyncWrite trait or creating their own shims.

Steps:

@WenyXu
Copy link
Member

WenyXu commented Aug 2, 2024

Interesting, I'm going to have a try🤩

@Xuanwo
Copy link
Member Author

Xuanwo commented Aug 2, 2024

Interesting, I'm going to have a try🤩

Perfect!

@Xuanwo
Copy link
Member Author

Xuanwo commented Aug 2, 2024

Perhaps we could have an integrations/parquet.

@WenyXu
Copy link
Member

WenyXu commented Aug 3, 2024

I'm considering introducing multiple versions of parquet via

[dependencies]
parquet_51 = { package = "parquet", version = "51.0", optional: true }
parquet_52 = { package = "parquet", version = "52.0", optional: true }

Due to we are still using parquet 51.0.0 🥲. Would you like to have any ideas? cc @waynexia

@Xuanwo
Copy link
Member Author

Xuanwo commented Aug 3, 2024

Since arrow's release going to be stable, maybe we can keep track with upstream instead?

@WenyXu
Copy link
Member

WenyXu commented Aug 3, 2024

Since arrow's release going to be stable, maybe we can keep track with upstream instead?

Make sense🥹

@WenyXu
Copy link
Member

WenyXu commented Aug 5, 2024

For AsyncFileReader, I am considering introducing a feature to merge small ranges into a large chunk(e.g., https://github.com/datafuselabs/databend/blob/a98335d33e7abfd34189e7f32c06ab34d53c64d0/src/query/storages/fuse/src/io/read/block/block_reader_merge_io_async.rs#L46-L68). Would you like to have any ideas?

@waynexia
Copy link
Member

waynexia commented Aug 5, 2024

Depends on multiple versions of parquet may bring lots of burdens on maintenance, especially when the versions increase. This is usually achieved via maintaining multiple branches and are released separately to migrate the API change of parquet, which doesn't seems to match our situation at present neither. So I prefer not to have this at least at beginning. We can reconsider this when there's a real requirement like a specific version of parquet is LTS and used widely.

@Xuanwo
Copy link
Member Author

Xuanwo commented Aug 6, 2024

For AsyncFileReader, I am considering introducing a feature to merge small ranges into a large chunk(e.g., datafuselabs/databend@a98335d/src/query/storages/fuse/src/io/read/block/block_reader_merge_io_async.rs#L46-L68). Would you like to have any ideas?

Seems we can use Reader::fetch internally.

@WenyXu
Copy link
Member

WenyXu commented Aug 6, 2024

Seems we can use Reader::fetch internally.

Cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants