Visit the Tools Reference main page
Visit the TSV Utilities main page
Synopsis: tsv-summarize [options] file [file...]
tsv-summarize
generates summary statistics on fields of a TSV file. A variety of statistics are supported. Calculations can run against the entire data stream or grouped by key. Consider the file data.tsv:
Make Color Time
ford blue 131
chevy green 124
ford red 128
bmw black 118
bmw black 126
ford blue 122
The min and average 'time' values for the 'make' field is generated by the command:
$ tsv-summarize --header --group-by Make --min Time --mean Time data.tsv
This produces:
Make Time_min Time_mean
ford 122 127
chevy 124 124
bmw 118 122
Using --group-by Make,Color
will group by both 'Make' and 'Color'. Omitting the --group-by
entirely summarizes fields for the full file.
The examples above specify fields by name. Fields can also be specified by field number, this works for files with and without header lines. For example:
$ tsv-summarize --header --group-by 1 --min 3 --mean 3 data.tsv
See Field syntax for more information about specifying fields.
tsv-summarize
tries to generate useful headers, but custom headers can be specified. Example:
$ tsv-summarize --header --group-by 1 --min 3:Fastest --mean 3:Average data.tsv
Make Fastest Average
ford 122 127
chevy 124 124
bmw 118 122
Most operators take custom headers in a manner shown above, following the syntax:
--<operator-name> FIELD[:header]
Operators can be specified multiple times. They can also take multiple fields (though not when a custom header is specified). Examples:
--median 1,5-8
--median Count,Kilograms
--median '*_seconds'
The quantile operator requires one or more probabilities after the fields:
--quantile Count:0.25 # Quantile 1 of the 'Count' field
--quantile 2-4:0.25,0.5,0.75 # Q1, Median, Q3 of fields 2, 3, 4
Summarization operators available are:
count range mad values
retain sum var unique-values
first mean stddev unique-count
last median mode missing-count
min quantile mode-count not-missing-count
max
Calculated numeric values are printed to 12 significant digits by default. This can be changed using the --p|float-precision
option. If six or less it sets the number of significant digits after the decimal point. If greater than six it sets the total number of significant digits.
Calculations hold onto the minimum data needed while reading data. A few operations like median keep all data values in memory. These operations will start to encounter performance issues as available memory becomes scarce. The size that can be handled effectively is machine dependent, but often quite large files can be handled.
Operations requiring numeric entries will signal an error and terminate processing if a non-numeric entry is found.
Missing values are not treated specially by default, this can be changed using the --x|exclude-missing
or --r|replace-missing
option. The former turns off processing for missing values, the latter uses a replacement value.
Options:
--h|help
- Print help.--help-verbose
- Print detailed help.--help-fields
- Print help on specifying fields.--V|version
- Print version information and exit.--g|group-by <field-list>
- Fields to use as key.--H|header
- Treat the first line of each file as a header.--w|write-header
- Write an output header even if there is no input header.--d|delimiter CHR
- Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)--v|values-delimiter CHR
- Values delimiter. Default: vertical bar (|). (Single byte UTF-8 characters only.)--p|float-precision NUM
- 'Precision' to use printing floating point numbers. Affects the number of digits printed and exponent use. Default: 12--x|exclude-missing
- Exclude missing (empty) fields from calculations.--r|replace-missing STR
- Replace missing (empty) fields with STR in calculations.
Operators:
--count
- Count occurrences of each unique key (--g|group-by
), or the total number of records if no key field is specified.--count-header STR
- Count occurrences of each unique key, like--count
, but use STR as the header.--retain <field-list>
- Retain one copy of the field. The field header is unchanged.--first <field-list>[:STR]
- First value seen.--last <field-list>[:STR]
- Last value seen.--min <field-list>[:STR]
- Min value. (Numeric fields only.)--max <field-list>[:STR]
- Max value. (Numeric fields only.)--range <field-list>[:STR]
- Difference between min and max values. (Numeric fields only.)--sum <field-list>[:STR]
- Sum of the values. (Numeric fields only.)--mean <field-list>[:STR]
- Mean (average). (Numeric fields only.)--median <field-list>[:STR]
- Median value. (Numeric fields only. Reads all values into memory.)--quantile <field-list>:p[,p...][:STR]
- Quantiles. One or more fields, then one or more 0.0-1.0 probabilities. (Numeric fields only. Reads all values into memory.)--mad <field-list>[:STR]
- Median absolute deviation from the median. Raw value, not scaled. (Numeric fields only. Reads all values into memory.)--var <field-list>[:STR]
- Variance. (Sample variance, numeric fields only).--stdev <field-list>[:STR]
- Standard deviation. (Sample st.dev, numeric fields only).--mode <field-list>[:STR]
- Mode. The most frequent value. (Reads all unique values into memory.)--mode-count <field-list>[:STR]
- Count of the most frequent value. (Reads all unique values into memory.)--unique-count <field-list>[:STR]
- Number of unique values. (Reads all unique values into memory).--missing-count <field-list>[:STR]
- Number of missing (empty) fields. Not affected by the--x|exclude-missing
or--r|replace-missing
options.--not-missing-count <field-list>[:STR]
- Number of filled (non-empty) fields. Not affected by--r|replace-missing
.--values <field-list>[:STR]
- All the values, separated by--v|values-delimiter
. (Reads all values into memory.)--unique-values <field-list>[:STR]
- All the unique values, separated by--v|values-delimiter
. (Reads all unique values into memory.)
Tip: Bash completion is very helpful when using commands like tsv-summarize
that have many options. See Enable bash-completion for details.