Tiny utilities

Tiny scripts that work on a single column of data. Some of these transform a single column of their input while passing everything through, some produce summary tables, and some produce a single summary value. All (so far) are in awk, and all have the common options header=0 which specifies the number of header lines on input to skip, skip_comment=1 which specifies whether to skip comment lines on input that begin with #, and col=1, which specifies which column of the input stream should be examined. Since they are awk scripts, they also have the standard variables for specifying the input field separator FS="\t" and the output field separator OFS="\t". Default output column separator is "\t".

Set any of these variables by using key=value on the command line. For example to find the median of the third column of numbers, when the first 10 lines of input are header:

median col=3 header=10 your.dat

Stick these in a pipeline that ends with spark for quick visual summaries. If indels.vcf.gz is a compressed VCF file containing indel calls, then this will print a sparkline of indel sizes in the range of ±10bp:

$ zcat indels.vcf.gz \
| stripfilt \
| awk '{print length($5)-length($4)}' \
| inrange abs=10 \
| hist \
| cut -f2 \
| spark
▁▁▁▁▁▁▁▁▂█▁▇▂▁▁▁▁▁▁▁▁

We get the second column of hist output because that's the counts. This clearly shows the overabundance of single-base indels, and a slight overrepresentation of single-base deletions over insertions.

Transformers: output same as input with single column transformed

boolify : transform a column into 0 or 1 based on its current value

cumsum : replace a column with its cumulative sum

log : transform a column into its natural logarithm

log10 : transform a column into its base-10 logarithm

mult : multiply a column by a given factor

round : round values to a given number of digits= after the decimal point

Filters: output same as input with a subset of lines selected

inrange : pass through lines for which the value of a column falls within a given range of values

inrange col=3 abs=10 your.dat | ... # column 3 is between -10 and 10 inclusive
inrange min=0 max=1000 your.dat | ...  # column 1 is between 0 and 1000 inclusive
inrange min=10000 your.dat | ... # column 1 is at least 10000 inclusive

stripfilt : strip header and comment lines beginning with #, or only pass headers and comment lines; can include empty/whitespace lines

stripfilt your.dat | ... # remove default 1-line header and comments
stripfilt inverse=1 skip_comment=0 your.dat | ... # pass through only the header
stripfilt inverse=1 header=0 your.dat | ... # pass through only comments
stripfilt skip_blank=1 your.dat | ... # also remove empty and whitespace-only lines

Condensers: output condensed from and some function of the input

diffs : produce successive pairwise numeric differences: 2nd - 1st, 3rd - 2nd, etc. Length of output is length of data in input column - 1.

ncol : print the number of columns in each line

Tablifiers: count summaries of input

hist : create a count histogram from a numeric column, grouping values into integer bins of [ i, i + 1). Bins within the input range not having values in the input are printed with a count of 0. To protect against potential errors in input or huge output, there must be more than sparse=0.01 fraction of the input range occupied otherwise a message is printed instead of the full histogram; use override=1 to override this behavior. Use drop_zero=1 to drop zero-valued bins from the output; this option implies override=1.

table : count the occurrences of unique values in a column and print a table of the values and their counts

Summarizers: calculate summary values

mean : ... of a column

median : ... of a column

min : ... of a column

max : ... of a column

range : min and max of a column, separated by a tab

sum : ... of a column

More examples

$ cat tests/tinyutils.dat
7
9
3
12.2
0
12
9
4

$ boolify tests/tinyutils.dat
1
1
1
1
0
1
1
1

$ cumsum tests/tinyutils.dat
7
16
19
31.2
31.2
43.2
52.2
56.2

$ diffs tests/tinyutils.dat
2
-6
9.2
-12.2
12
-3
-5

$ hist tests/tinyutils.dat
0	1
1	0
2	0
3	1
4	1
5	0
6	0
7	1
8	0
9	2
10	0
11	0
12	2

$ inrange min=1 max=8 tests/tinyutils.dat
7
3
4

$ inrange abs=4 tests/tinyutils.dat
3
0
4

$ log tests/tinyutils.dat
1.94591
2.19722
1.09861
2.50144
-inf
2.48491
2.19722
1.38629

$ log10 tests/tinyutils.dat
0.845098
0.954243
0.477121
1.08636
-inf
1.07918
0.954243
0.60206

$ max tests/tinyutils.dat
12.2

$ mean tests/tinyutils.dat
7.025

$ median tests/tinyutils.dat
8

$ min tests/tinyutils.dat
0

$ mult mult=2 tests/tinyutils.dat
14
18
6
24.4
0
24
18
8

$ range tests/tinyutils.dat
0	12.2

$ round digits=0 tests/tinyutils.dat
7
9
3
12
0
12
9
4

$ stripfilt tests/tinyutils.dat  # default is a single-line header
9
3
12.2
0
12
9
4

$ stripfilt inverse=1 tests/tinyutils.dat
7

$ sum tests/tinyutils.dat
56.2

$ table tests/tinyutils.dat
3	1
4	1
7	1
9	2
12	1
12.2	1
0	1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiny utilities

Transformers: output same as input with single column transformed

Filters: output same as input with a subset of lines selected

Condensers: output condensed from and some function of the input

Tablifiers: count summaries of input

Summarizers: calculate summary values

More examples

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
boolify		boolify
cumsum		cumsum
diffs		diffs
hist		hist
inrange		inrange
log		log
log10		log10
max		max
mean		mean
median		median
min		min
mult		mult
ncol		ncol
range		range
round		round
stripfilt		stripfilt
sum		sum
table		table

License

devsullivan/tinyutils

Folders and files

Latest commit

History

Repository files navigation

Tiny utilities

Transformers: output same as input with single column transformed

Filters: output same as input with a subset of lines selected

Condensers: output condensed from and some function of the input

Tablifiers: count summaries of input

Summarizers: calculate summary values

More examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages