Append to existing parquet file? #13

dazzag24 · 2019-11-19T14:28:21Z

Hi,

Is it possible to run csv2parquet in a way that means it appends to an existing parquet file? I have a large number of CSV file with a fixed schema that I'd like to convert into a single parquet file.

Thanks

cldellow · 2019-11-19T19:29:24Z

The nature of the Parquet format is such that you can't really update an existing file. (Wellllll, maybe it's technically possible to do surgery and add new row groups and then update the footer. It'd require using relatively low-level APIs in the parquet library, though.)

We could provide an experience that mimics this, but it'd be recreating the entire file each time.

Would some other options work:

do nothing - the user has to create a single CSV, e.g. by:

(cat file1.csv; tail -n+2 file2.csv; tail -n+2 file3.csv) > tmp.csv
csv2parquet tmp.csv
rm tmp.csv

permit process substitution, so we support (1) but without the intermediate file

csv2parquet <(cat file1.csv; tail -n+2 file2.csv; tail -n+2 file3.csv)

Currently, this will fail with No such file or directory: /dev/fd/63.parquet.

permit taking an arbitrary number of CSV file arguments, but use the 1st as the source of column names/types:

csv2parquet file1.csv file2.csv file3.csv

My gut feeling is that (3) would be a nice addition, and a relatively small change to the existing code. If the concern is that you want to append to avoid doing the computation work that compresses/optimizes the parquet file, none of these is suitable, though.

dazzag24 · 2019-11-20T09:23:59Z

Hi,
Ah yes I recall reading somewhere that parquet is not something that naturally lends itself to simple appending, so I understand the difficulties here.

For now I had come up with some bash for loops to combine my nested dir structure and cat all the *.csv.gz files into one large .csv.gz. I was then gunzip'ing this and calling csv2parquet. So pretty much a variant of (1).

Unfortunately at this point I discovered that occasionally the rows in the CSV has extra commas, hence yesterdays pull request to help me discover which lines are broken!

So if you have lots of CSV files then (3) is good, although how would this work with xargs for example?

However there is an argument that (2) e.g. read from stdin would allow you to pipe the output of zcat directly into csv2parquet without the need to decompress to disk first.

Thanks

cldellow · 2019-11-21T16:54:56Z

So if you have lots of CSV files then (3) is good, although how would this work with xargs for example?

Wouldn't it just work? :) e.g.:

find -type f | grep .csv$ | xargs csv2parquet (apologies, I always forget the correct usage of -name, so am using grep to filter)

The only case where it'd fail is if your list of files was so big that your shell environment space was exhausted. Is that what you're getting at?

read from stdin would allow you to pipe the output of zcat directly into csv2parquet

If this is common (and it probably is), I'd support teaching csv2parquet how to sniff the file and do the gzip decompression on the fly. Process substitution has some subtle warts - it's not supported across all shells, and failures in the underlying process don't bubble up, e.g.:

$ echo <(false)
/dev/fd/63
$ echo $?
0

1beb · 2021-08-23T12:51:55Z

I feel like this is probably done better without loading the files into Py. You can do it from shell:

Suppose you have file1,csv, ... file3.csv and you'd like to concatenate them so that they are 1, 2, 3 stacked.

# remove header from f2 and 3
sed -i '' 1d file2.csv
sed -i '' 1d file3.csv
# append file2 to file1
file2.csv >> file1.csv
# append file3 to file1
file3.csv >> file1.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append to existing parquet file? #13

Append to existing parquet file? #13

dazzag24 commented Nov 19, 2019

cldellow commented Nov 19, 2019

dazzag24 commented Nov 20, 2019

cldellow commented Nov 21, 2019 •

edited

Loading

1beb commented Aug 23, 2021

Append to existing parquet file? #13

Append to existing parquet file? #13

Comments

dazzag24 commented Nov 19, 2019

cldellow commented Nov 19, 2019

dazzag24 commented Nov 20, 2019

cldellow commented Nov 21, 2019 • edited Loading

1beb commented Aug 23, 2021

cldellow commented Nov 21, 2019 •

edited

Loading