Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support saving meta and data seperately #400

Open
asfimport opened this issue Oct 27, 2022 · 0 comments
Open

support saving meta and data seperately #400

asfimport opened this issue Oct 27, 2022 · 0 comments

Comments

@asfimport
Copy link
Collaborator

I often needs to create tens of milliions of small dataframes and save them into parquet files. all these dataframes have the same column and index information. and normally they have the same number of rows(around 300).  

as the data is quite small, the parquet meta information is relatively large and it's quite a big waste of disk space, as the same meta information is repeated tens of millions of times.

concating them into one big parquet file can save disk space, but it's not friendly for parallel processing of each small dataframe. 

 

if I can save one copy of the meta information into one file, and the rest parquet files contains only the data. then the disk space can be saved, and still good for parallel processing.

seems to me this is possible by design, but I couldn't find any API supporting this.

Reporter: lei yu

Note: This issue was originally created as PARQUET-2207. Please see the migration documentation for further details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant