Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pqarrow in zio/parquetio #4547

Merged
merged 1 commit into from
Apr 27, 2023
Merged

Use pqarrow in zio/parquetio #4547

merged 1 commit into from
Apr 27, 2023

Conversation

nwt
Copy link
Member

@nwt nwt commented Apr 25, 2023

Reading and writing are much faster with it than with github.com/fraugster/parquet-go. Its only apparent drawback is that it offers no easy way to support Zed's duration and float16 types, and writing a value containing either produces a cryptic error.

$ echo '{a:1.(float16)}' | zq -f parquet -
parquetio: unsupported type: not implemented yet

Closes #764, closes #4278, and closes #4527.

Reading and writing are much faster with it than with
github.com/fraugster/parquet-go.  Its only apparent drawback is that it
offers no easy way to support Zed's duration and float16 types, and
writing a value containing either produces a cryptic error.

    $ echo '{a:1.(float16)}' | zq -f parquet -
    parquetio: unsupported type: not implemented yet

Closes #764, closes #4278, and closes #4527.
@nwt nwt requested a review from a team April 25, 2023 19:47
@philrz
Copy link
Contributor

philrz commented Apr 25, 2023

@nwt: Regarding the lack of float16 and duration support, in your travels did you happen to find any open issues for the library pointing at this limitation? If not, would it be appropriate to open our own? It definitely seems there's net positives from moving to the new lib, but since we're still losing a little it might be nice if there were issues I could watch via Notifications so we know if/when things change. Given that it's two libraries that both purport to offer "Parquet" support I'd hope they'd one day converge, though maybe I shouldn't hold my breath. 😄

@nwt
Copy link
Member Author

nwt commented Apr 26, 2023

@philrz: Our current support for these types is entirely on our side. There's no support for them in github.com/fraugster/parquet-go (I don't count the INTERVAL converted type because of its month component) and no issue for either in github.com/apache/arrow, and I doubt there ever will be given the pace of progress on apache/parquet-format#43, apache/parquet-format#165, and apache/parquet-format#184.

@philrz
Copy link
Contributor

philrz commented Apr 26, 2023

This comment is basically a note-to-self to summarize an offline conversation @nwt and I had, as I may need to refer back to this in the future. What I now understand is that even though our prior Parquet support was able to output what started as Zed values of the float16 or duration types, what ended up in the Parquet files were actually float32 and interval. In essence we've never had true "round-trip" support with Parquet (i.e., if you wrote out values of those types with our old Parquet writer and then read that Parquet back in with Zed, you'd not get back values with Zed float16 or duration types) so if it could be said anything was lost here in terms of functionality it was arguably only ever a placebo. As for the "cryptic error" that appears now, it's understood that ideally it would point to the specific type/value that caused it to choke, but the work involved to create the ideal error message appears non-trivial so we're keen to defer that effort until users actually bump into it.

@nwt nwt merged commit deea4a4 into main Apr 27, 2023
@nwt nwt deleted the pqarrow branch April 27, 2023 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants