Warning
The order of the chromosomes in the bedgraph file have to be consistent with the order specified for the assembly in the negspy repository.
input_file=~/Downloads/phastCons100way.txt.gz;
output_file=~/Downloads/phastConst100way_ordered.txt;
chromnames=$(awk '{print $1}' ~/projects/negspy/negspy/data/hg19/chromInfo.txt);
for chr in $chromnames;
do echo ${chr};
zcat $input_file | grep "\t${chr}\t" >> $output_file;
done;
Assume we have an input file that has id chr start end value1 value2
pairs:
location chrom start end copynumber segmented 1:2900001-3000000 1 2900001 3000000 -0.614 -0.495 1:3000001-3100000 1 3000001 3100000 -0.407 -0.495 1:3100001-3200000 1 3100001 3200000 -0.428 -0.495 1:3200001-3300000 1 3200001 3300000 -0.437 -0.495
We can aggregate this file by recursively summing adjacent values. We have to
indicate which column corresponds to the chromosome (--chromosome-col 2
),
the start position (--from-pos-col 3
), the end position (--to-pos-col 4
)
and the value column (--value-col 5
). We specify that the first line
of the data file contains a header using the (--has-header
) option.
clodius aggregate bedgraph \
test/sample_data/cnvs_hw.tsv \
--output-file ~/tmp/cnvs_hw.hitile \
--chromosome-col 2 \
--from-pos-col 3 \
--to-pos-col 4 \
--value-col 5 \
--assembly grch37 \
--nan-value NA \
--transform exp2 \
--has-header
The dataset used in this example contains copy number data that has been log2 transformed. That is, the copy number given for each bin is the log2 of the computed value. This is a problem for HiGlass's default aggregation method of summing adjacent values since \log_2 a + \log_2 b \neq \log_2 ab.
Using the --transform exp2
option tells clodius to raise two to the
power of the provided value before doing the transformation and storing. As
an added benefit, NaN values become apparent in the resulting because they
have values of 0.
NaN (not a number) values in the input file can be specified using the
--nan-value
option. For example, --nan-value NA
indicates that
whenever NA is encountered as a value it should be treated as NaN. In the
current implementation, NaN values are simply treated as 0. In the future, they
should be assigned a special value so that they are ignored by HiGlass.
When NaN values are aggregated by summing, they are treated as 0 when added to another number. When two NaN values are added to each other, however, the result is Nan.
Sometimes, we just want to count the number of NaN values in the file. The
--count-nan
option effectively treats NaN values as 1 and all other values
as 0. This makes it possible to display a track showing how many NaN values are
present in each interval. It also makes it possible to create compound tracks
which use that information to normalize track values.