Skip to content

Step 2: Generation of chromosome arrays

Luca Santuari edited this page Feb 5, 2020 · 1 revision

The script chr_array expects the following input.

  • a bigWig file generated with the GEM library with the SGE script containing the mappability track for 151 bp reads (see files GRCh37.151mer.bw and GRCh38.151mer.bw respectively for the two versions of the reference genome). The GEM library is currently not on conda. The conda package Gem is another program. <- AK. so why not using this one in combination with ucsc-bedgraphtobigwig (bioconda)? LS. We need to add the GEM library to conda first.
  • a split_reads.json.gz file generated with split_reads
  • a clipped_reads.json.gz file generated with clipped_reads
  • a [chromosome_name]_coverage.npy.gz file for each chromosome, generated with coverage
  • a [chromosome_name]_clipped_read_distance.json.gz file for each chromosome, generated with clipped_read_distance
  • a [chromosome_name]_snv.npy.gz file for each chromosome, generated with snv

Output: the script generates one bcolz carray for each chromosome.

Each carray has shape (chr_length, num_channels), where chr_length is the length of the chromosome, num_channels is the number of channels (46). The channels are, in order.

  1. coverage: read coverage for reads with MAPQ>=10
  2. discordant_reads_F: read coverage for forward discordant reads
  3. discordant_reads_R: read coverage for reverse discordant reads
  4. mean_read_quality: mean read mapping quality as reported by bwa
  5. median_base_quality: median base quality
  6. SNV_frequency: frequency of bases ([A,T,C,G]) that are different from the reference base. Set to 0 if the reference base is N.
  7. left_clipped_reads_F: number of left-clipped forward reads. Set at clipped position.
  8. left_clipped_reads_R: number of left-clipped reverse reads. Set at clipped position.
  9. right_clipped_reads_F: number of right-clipped forward reads. Set at clipped position.
  10. right_clipped_reads_R: number of right-clipped reverse reads. Set at clipped position.
  11. disc_right_clipped_reads_F: number of discordant right-clipped forward reads. Set at clipped position.
  12. disc_right_clipped_reads_R: number of discordant right-clipped reverse reads. Set at clipped position.
  13. disc_left_clipped_reads_F: number of discordant left-clipped forward reads. Set at clipped position.
  14. disc_left_clipped_reads_R: number of discordant left-clipped reverse reads. Set at clipped position.
  15. CIGAR_D_left_reads_F: number of forward reads containing 'D' for deletion in the CIGAR string. Minimum deletion length = 50 bp. Set at left-most position.
  16. CIGAR_D_left_reads_R: number of reverse reads containing 'D' for deletion in the CIGAR string. Minimum deletion length = 50 bp. Set at left-most position.
  17. CIGAR_D_right_reads_F: number of forward reads containing 'D' for deletion in the CIGAR string. Minimum deletion length = 50 bp. Set at right-most position.
  18. CIGAR_D_right_reads_R: number of reverse reads containing 'D' for deletion in the CIGAR string. Minimum deletion length = 50 bp. Set at right-most position.
  19. CIGAR_I_right_reads_F: number of forward reads containing 'I' for insertion in the CIGAR string. Minimum insertion length = 50 bp. Set at left-most position.
  20. CIGAR_I_right_reads_R: number of reverse reads containing 'I' for insertion in the CIGAR string. Minimum insertion length = 50 bp. Set at left-most position.
  21. left_split_reads_F: number of left-split forward reads. Set at split position.
  22. left_split_reads_R: number of left-split reverse reads. Set at split position.
  23. right_split_reads_F: number of right-split forward reads. Set at split position.
  24. right_split_reads_R: number of right-split reverse reads. Set at split position.
  25. INV_before: number of clipped reads with FF or RR orientation, where the mate is mapped before the read.
  26. INV_after: number of clipped reads with FF or RR orientation, where the mate is mapped after the read.
  27. DUP_before: number of clipped reads with FR orientation, where the mate is mapped before the read.
  28. DUP_after: number of clipped reads with RF orientation, where the mate is mapped after the read.
  29. TRA_opposite: number of clipped reads with FR or RF orientation, where read and mate are mapped on different chromosomes.
  30. TRA_same: number of clipped reads with FF or RR orientation, where read and mate are mapped on different chromosomes.
  31. Forward_Left_ClippedRead_distance_median: median paired end distance for left-clipped forward reads.
  32. Forward_Right_ClippedRead_distance_median: median paired end distance for right-clipped forward reads.
  33. Forward_All_Reads_distance_median: median paired end distance for all forward reads.
  34. Reverse_Left_ClippedRead_distance_median: median paired end distance for left-clipped reverse reads.
  35. Reverse_Right_ClippedRead_distance_median: median paired end distance for right-clipped reverse reads.
  36. Reverse_All_Reads_distance_median: median paired end distance for all reverse reads.
  37. L_SplitRead_distance_median_F: median split read distance for forward split reads. Set at left-most position.
  38. L_SplitRead_distance_median_R: median split read distance for reverse split reads. Set at left-most position.
  39. R_SplitRead_distance_median_F: median split read distance for forward split reads. Set at right-most position.
  40. R_SplitRead_distance_median_R: median split read distance for reverse split reads. Set at right-most position.
  41. Mappability: GEM Mappability
  42. One_hot_encoding_A: 1 for A, 0 otherwise
  43. One_hot_encoding_T: 1 for T, 0 otherwise
  44. One_hot_encoding_C: 1 for C, 0 otherwise
  45. One_hot_encoding_G: 1 for G, 0 otherwise
  46. One_hot_encoding_N: 1 for N, 0 otherwise

Discordant reads are reads with is_proper_pair flag set to False by bwa (for Illumina short insert size reads, FR orientation and insert size below a certain value. See 'Estimating Insert Size Distribution' in bwa manual.