Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append to existing parquet file? #13

Open
dazzag24 opened this issue Nov 19, 2019 · 4 comments
Open

Append to existing parquet file? #13

dazzag24 opened this issue Nov 19, 2019 · 4 comments

Comments

@dazzag24
Copy link
Contributor

Hi,

Is it possible to run csv2parquet in a way that means it appends to an existing parquet file? I have a large number of CSV file with a fixed schema that I'd like to convert into a single parquet file.

Thanks

@cldellow
Copy link
Owner

The nature of the Parquet format is such that you can't really update an existing file. (Wellllll, maybe it's technically possible to do surgery and add new row groups and then update the footer. It'd require using relatively low-level APIs in the parquet library, though.)

We could provide an experience that mimics this, but it'd be recreating the entire file each time.

Would some other options work:

  1. do nothing - the user has to create a single CSV, e.g. by:
(cat file1.csv; tail -n+2 file2.csv; tail -n+2 file3.csv) > tmp.csv
csv2parquet tmp.csv
rm tmp.csv
  1. permit process substitution, so we support (1) but without the intermediate file
csv2parquet <(cat file1.csv; tail -n+2 file2.csv; tail -n+2 file3.csv)

Currently, this will fail with No such file or directory: /dev/fd/63.parquet.

  1. permit taking an arbitrary number of CSV file arguments, but use the 1st as the source of column names/types:
csv2parquet file1.csv file2.csv file3.csv

My gut feeling is that (3) would be a nice addition, and a relatively small change to the existing code. If the concern is that you want to append to avoid doing the computation work that compresses/optimizes the parquet file, none of these is suitable, though.

@dazzag24
Copy link
Contributor Author

Hi,
Ah yes I recall reading somewhere that parquet is not something that naturally lends itself to simple appending, so I understand the difficulties here.

For now I had come up with some bash for loops to combine my nested dir structure and cat all the *.csv.gz files into one large .csv.gz. I was then gunzip'ing this and calling csv2parquet. So pretty much a variant of (1).

Unfortunately at this point I discovered that occasionally the rows in the CSV has extra commas, hence yesterdays pull request to help me discover which lines are broken!

So if you have lots of CSV files then (3) is good, although how would this work with xargs for example?

However there is an argument that (2) e.g. read from stdin would allow you to pipe the output of zcat directly into csv2parquet without the need to decompress to disk first.

Thanks

@cldellow
Copy link
Owner

cldellow commented Nov 21, 2019

So if you have lots of CSV files then (3) is good, although how would this work with xargs for example?

Wouldn't it just work? :) e.g.:

find -type f | grep .csv$ | xargs csv2parquet (apologies, I always forget the correct usage of -name, so am using grep to filter)

The only case where it'd fail is if your list of files was so big that your shell environment space was exhausted. Is that what you're getting at?

read from stdin would allow you to pipe the output of zcat directly into csv2parquet

If this is common (and it probably is), I'd support teaching csv2parquet how to sniff the file and do the gzip decompression on the fly. Process substitution has some subtle warts - it's not supported across all shells, and failures in the underlying process don't bubble up, e.g.:

$ echo <(false)
/dev/fd/63
$ echo $?
0

@1beb
Copy link

1beb commented Aug 23, 2021

I feel like this is probably done better without loading the files into Py. You can do it from shell:

Suppose you have file1,csv, ... file3.csv and you'd like to concatenate them so that they are 1, 2, 3 stacked.

# remove header from f2 and 3
sed -i '' 1d file2.csv
sed -i '' 1d file3.csv
# append file2 to file1
file2.csv >> file1.csv
# append file3 to file1
file3.csv >> file1.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants