Skip to content

Parquet wildcard writing #489

Open
Open
@alex-zaitsev

Description

@alex-zaitsev

INSERT INTO s3(‘s3://<my_bucket>/myfiles*.parquet’)

That will automatically split data into multiple files, use existing min_insert_block_size_rows/bytes. Should close ClickHouse#41537

For example, other systems implement it as follows:

BigQuery

The path must contain exactly one wildcard * anywhere in the leaf directory of the path string, for example, ../aa/, ../aa/bc, ../aa/bc, and ../aa/bc. BigQuery replaces * with 0000..N depending on the number of files exported. BigQuery determines the file count and sizes. If BigQuery decides to export two files, then * in the first file's filename is replaced by 000000000000, and * in the second file's filename is replaced by 000000000001.

RedShift

to 's3://amzn-s3-demo-bucket/unload/venue_pipe_'

By default, UNLOAD writes one or more files per slice. Assuming a two-node cluster with two slices per node, the previous example creates these files in amzn-s3-demo-bucket as follows:
venue_pipe_0000_part_00
venue_pipe_0001_part_00
venue_pipe_0002_part_00
venue_pipe_0003_part_00

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions