Some of the scripts here are to aid data inspection in the command line, avoiding
the need to fire up R or Python.
Also useful are pick (for selecting/combining/transforming/filtering tabular data)
and transpose
from github.com/micans/reaper for fast and low-memory transposition
of (large) tabular data.
Highly useful command line data wrangling: GNU datamash.
Apparix used to live here but has its own repository now.
pick
used to live here but has its own repository .
It is a concise command-line query/programming tool to manipulate streamed data columns and rows.
It can be thought of as (unix) cut
on steroids, augmented with aspects of R
and awk
.
hissyfit
can be used at the end of Unix pipes (or read from file) to draw
terminal histograms and bar charts for quickly gauging numerical data and count
data. Hissyfit documentation and examples.
-
merge-files-col.sh
This merges columns of files usingtranspose
from the reaper distribution. It is quite a bit faster and much more memory efficient than a straightforward Python or perl implementation. -
peach.c
PArEnthesis CHecker. It's not smart about anything and will complain about things like/* my little list 1) foo 2) bar */
and"->)(<-"
. Still I've found it helpful over the years. It checks{}
,()
and[]
. This is the first C program I wrote, so please be gentle. -
wordmer.pl
generate all words of lengthk
over some alphabet. -
tallyho.sh
tally the firstN
sequences of a fastq file, for example to look at index reads. This usestally
from the reaper distribution. -
bubba
Bsub/wrapper LSF submission script to take some pain away. It prints the constructed bsub command and submits it. Several options including dry run (-T).
Space/time bash functions in .bash-workutils
(see further below). Most of
these will lead to a lot of disk access, use with care. A bunch of other
miscellaneous functions have been added. Two worth picking out are
funcfile NAME
find in which file functionNAME
is defined.ls_func FILE
list all functions defined inFILE
.
The list of functions in .bash-workutils
:
achoo bj bjl colcount colnames ffn funcfile gimme_sum grab groupify
howoldami lines ls_bigold ls_count_files ls_file_spread ls_func
ls_lastfile ls_ls ls_misc ls_mouldy ls_size_any ls_size_suffix myman
nchar procli public silent tailafter ungroupify
Space time functions in more detail:
--- ls_bigold
List directories up to a certain depth, ordered by disk usage,
with the number of days since last modified.
Argument: directory depth.
Example:
ls_bigold 2
NOTE: in a project/team root directory this may take some time and
tax the file system. Perhaps best to save the output in a file.
CAVEAT subdirectories of a directory may have changed. Use as guide!
USEFUL order the output by the third column to group directories together,
e.g. ls_bigold 2 > out.bigold; sort -k 3 out.bigold
--- ls_mouldy
Find directories left untouched for longer than first argument (in days)
up to a depth of second argument.
Example:
ls_mouldy 183 3
CAVEAT subdirectories of a directory may have changed. Use as guide!
--- ls_size_any
List all regular files recursively and sort by human-readable size.
First optional argument: lower bound e.g. 10M or 16k, or 0k
Second optional argument: upper bound e.g. 4k (useful for small files)
Example:
ls_size_any 10M # find files larger than 10M
ls_size_any 0k 4k # find small files
--- ls_size_suffix
Find files ending with suffix recursively, sort by human-readable size.
First argument: suffix, e.g. .cram or .fastq.gz
Second (optional) argument: a lower bound for size, e.g. 10M or 64k.
Example:
ls_size_suffix .fastq.gz
ls_size_suffix .cram 500M
ls_size_suffix .cram 1G
--- ls_file_spread
For each directory count the number of files in it, recursively.
The output is sorted by count, with a total tally added.
Useful to check if applications are well-behaved and do not
crush the file system with large numbers of files in a single directory.
Modified from code by Glenn Jackmann on stackoverflow.
More small functions that I use in .bash-myutils
. These are slightly
more random than the ones in .bash-workutils
.
bash_max bash_min countcram cpuhours debug_bash decolon ffncp ffnl fqfa
helpme nflogcmd set_farm_mem themax themax2 themin themin2 theminmax