Skip to content

Latest commit

 

History

History
89 lines (68 loc) · 3.54 KB

bedgraph.rst

File metadata and controls

89 lines (68 loc) · 3.54 KB

BedGraph files

Warning

The order of the chromosomes in the bedgraph file have to be consistent with the order specified for the assembly in the negspy repository.

Ordering the chromosomes in the input file

input_file=~/Downloads/phastCons100way.txt.gz;
output_file=~/Downloads/phastConst100way_ordered.txt;
chromnames=$(awk '{print $1}' ~/projects/negspy/negspy/data/hg19/chromInfo.txt);
for chr in $chromnames;
    do echo ${chr};
    zcat $input_file | grep "\t${chr}\t" >> $output_file;
done;

Aggregation by addition

Assume we have an input file that has id chr start end value1 value2 pairs:

location        chrom   start   end     copynumber      segmented
1:2900001-3000000       1       2900001 3000000 -0.614  -0.495
1:3000001-3100000       1       3000001 3100000 -0.407  -0.495
1:3100001-3200000       1       3100001 3200000 -0.428  -0.495
1:3200001-3300000       1       3200001 3300000 -0.437  -0.495

We can aggregate this file by recursively summing adjacent values. We have to indicate which column corresponds to the chromosome (--chromosome-col 2), the start position (--from-pos-col 3), the end position (--to-pos-col 4) and the value column (--value-col 5). We specify that the first line of the data file contains a header using the (--has-header) option.

clodius aggregate bedgraph          \
    test/sample_data/cnvs_hw.tsv    \
    --output-file ~/tmp/cnvs_hw.hitile \
    --chromosome-col 2              \
    --from-pos-col 3                \
    --to-pos-col 4                  \
    --value-col 5                   \
    --assembly grch37               \
    --nan-value NA                  \
    --transform exp2                \
    --has-header

Data Transform

The dataset used in this example contains copy number data that has been log2 transformed. That is, the copy number given for each bin is the log2 of the computed value. This is a problem for HiGlass's default aggregation method of summing adjacent values since \log_2 a + \log_2 b \neq \log_2 ab.

Using the --transform exp2 option tells clodius to raise two to the power of the provided value before doing the transformation and storing. As an added benefit, NaN values become apparent in the resulting because they have values of 0.

NaN Value Identification

NaN (not a number) values in the input file can be specified using the --nan-value option. For example, --nan-value NA indicates that whenever NA is encountered as a value it should be treated as NaN. In the current implementation, NaN values are simply treated as 0. In the future, they should be assigned a special value so that they are ignored by HiGlass.

When NaN values are aggregated by summing, they are treated as 0 when added to another number. When two NaN values are added to each other, however, the result is Nan.

NaN Value Counting

Sometimes, we just want to count the number of NaN values in the file. The --count-nan option effectively treats NaN values as 1 and all other values as 0. This makes it possible to display a track showing how many NaN values are present in each interval. It also makes it possible to create compound tracks which use that information to normalize track values.