Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet): Adopt new parquet-wasm File reader API. #2987

Merged
merged 3 commits into from
May 6, 2024

Conversation

ibgreen
Copy link
Collaborator

@ibgreen ibgreen commented May 1, 2024

No description provided.

@ibgreen ibgreen force-pushed the ib/parse-parquet-wasm-file branch 2 times, most recently from 56e0480 to 3970312 Compare May 1, 2024 10:23
@ibgreen ibgreen marked this pull request as ready for review May 4, 2024 09:58
@ibgreen ibgreen requested a review from kylebarron May 4, 2024 09:59

let parquetFile: parquetWasm.ParquetFile;
if (file.handle instanceof Blob) {
// TODO - let's assume fromFile() works on Blobs and not just on File...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll publish a patch with this fix: kylebarron/parquet-wasm#530

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


const ipcStream = wasmTable.intoIPCStream();
const arrowTable = arrow.tableFromIPC(ipcStream);
const stream: ReadableStream<arrow.Table> = await parquetFile.stream(options);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a stream of wasm RecordBatch objects. You need to manage the movement to JS for each batch yourself. You can use the intoIPCStream method plus arrow.tableFromIPC

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a stream of wasm RecordBatch objects. You need to manage the movement to JS for each batch yourself.

Thanks, looks very promising indeed, will come back to this in separate PR as integrating it will need a bit more plumbing in loaders.gl

Nit: Any chance you could add a method that returns an async iterator so that I don't have to wrap things? Generally the Stream callback API is more fiddly to work with and AFAIK, stream incompatibilities between Node and browser are still a thing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that looks annoying.

Right now I'm using the wasm-streams library to return a ReadableStream. It doesn't look like there's any built-in support for returning an async iterator.

I have a note for myself (kylebarron/parquet-wasm#307) to point users to a polyfill. In testing https://observablehq.com/d/f5723cea6661fb71 a long time ago, I used this minimal polyfill, but I don't really understand what's happening

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I suspected you were relying on some wasm support library. The polyfill you copies is similar to but cleaner than the one I added: https://github.com/visgl/loaders.gl/blob/master/modules/parquet/src/lib/utils/make-stream-iterator.ts.

but I don't really understand what's happening

Yes that is the problem - converting between streams and iterators is way too obscure for most users.

  • At least Node streams define [Symbol.asyncIterator] and can be used as iterators, but the browser WhatWG streams can not.

@ibgreen ibgreen force-pushed the ib/parse-parquet-wasm-file branch from 3970312 to 9e89811 Compare May 6, 2024 10:37
@ibgreen ibgreen merged commit 4d79f93 into master May 6, 2024
1 check passed
@ibgreen ibgreen deleted the ib/parse-parquet-wasm-file branch May 6, 2024 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants