Skip to content
/ pq Public

The objetive is create a tool similar to jq but for parquet files

License

Notifications You must be signed in to change notification settings

tonivade/pq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pq

Parquet query tool

Objetive

Create a tool similar to jq but for parquet files.

Installation

The preferred instalation process is download one of the precompiled binary files that can be downloaded from releases page in github. You will find these different files:

Native images has been generated using graalvm-ce 21.0.2.

Current version is 0.5.0.

Usage

You will see all the available subcommands using help subcommand:

$ ./pq help
Usage: pq [-v] [COMMAND]
parquet query tool
  -v, --verbose   enable debug logs
Commands:
  count     print total number of rows in parquet file
  schema    print schema of parquet file
  read      print content of parquet file in json format
  metadata  print metadata of parquet file
  write     create a parquet file from a jsonl stream and a schema
  help      Display help information about the specified command.
Copyright(c) 2023 by @tonivade

count

Print number of rows in file:

$ ./pq help count
Usage: pq count [-v] [--filter=PREDICATE] FILE
print total number of rows in parquet file
      FILE                 parquet file
      --filter=PREDICATE   predicate to apply to the rows
  -v, --verbose            enable debug logs

Example:

$ ./pq count example.parquet
1000

Filter rows

You can get the count or rows that match a filter this way:

$ ./pq count --filter 'gender == "Male"' example.parquet
451

schema

Print parquet schema.

$ ./pq help schema
Usage: pq schema [-v] [--select=COLUMN[,COLUMN...]]... FILE
print schema of parquet file
      FILE        parquet file
      --select=COLUMN[,COLUMN...]
                  list of columns to select
  -v, --verbose   enable debug logs

Example:

$ ./pq schema example.parquet
message spark_schema {
  optional int32 id;
  optional binary first_name (STRING);
  optional binary last_name (STRING);
  optional binary email (STRING);
  optional binary gender (STRING);
  optional binary ip_address (STRING);
  optional binary cc (STRING);
  optional binary country (STRING);
  optional binary birthdate (STRING);
  optional double salary;
  optional binary title (STRING);
  optional binary comments (STRING);
}

Select columns

You can select the columns you want to include in the schema using the --select option.

$ ./pq schema --select id,first_name,email example.parquet
message spark_schema {
  optional int32 id;
  optional binary first_name (STRING);
  optional binary email (STRING);
}

read

Print file content.

$ ./pq help read
Usage: pq read [-v] [--index] [--filter=PREDICATE] [--format=JSON|CSV]
               [--get=ROW] [--head=ROWS] [--skip=ROWS] [--tail=ROWS]
               [--select=COLUMN[,COLUMN...]]... FILE
print content of parquet file in json format
      FILE                 parquet file
      --filter=PREDICATE   predicate to apply to the rows
      --format=JSON|CSV    output format, json or csv
      --get=ROW            print just the row with given index
      --head=ROWS          get the first N number of rows
      --index              print row index
      --select=COLUMN[,COLUMN...]
                           list of columns to select
      --skip=ROWS          skip a number N of rows
      --tail=ROWS          get the last N number of rows
  -v, --verbose            enable debug logs

Example:

$ ./pq read example.parquet
{"id":1,"first_name":"Amanda","last_name":"Jordan","email":"[email protected]","gender":"Female","ip_address":null,"cc":"6759521864920116","country":"Indonesia","birthdate":"3/8/1971","salary":49756.53,"title":"Internal Auditor","comments":"1E+02"}
...

Select columns

You can select the columns you want to include in the output using the --select option:

$ ./pq read --select id,first_name,email example.parquet
{"id":1,"first_name":"Amanda","email":"[email protected]"}
...

Filter rows

You can filter the rows that match a filter this way:

$ ./pq read --filter 'gender == "Male"' example.parquet
{"id":2,"first_name":"Albert","last_name":"Freeman","email":"[email protected]","gender":"Male","ip_address":"218.111.175.34","cc":"","country":"Canada","birthdate":"1/16/1968","salary":150280.17,"title":"Accountant IV","comments":""}
...

metadata

Print file metadata.

$ ./pq help metadata
Usage: pq metadata [-v] [--show-blocks] FILE
print metadata of parquet file
      FILE            parquet file
      --show-blocks   show block metadata info
  -v, --verbose       enable debug logs

Example:

$ ./pq metadata example.parquet
"createdBy":parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
"count":1000

write

Creates a parquet file from a jsonl/csv file and a shema.

$ ./pq help write
Usage: pq write [-v] [--format=JSON|CSV] [--schema=FILE] FILE
create a parquet file from a jsonl stream and a schema
      FILE                destination parquet file
      --format=JSON|CSV   input format, json or csv
      --schema=FILE       file with schema definition
  -v, --verbose           enable debug logs

License

This project is released under MIT License

About

The objetive is create a tool similar to jq but for parquet files

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages