(COpy Number variation Genotyping in Ancient genomes and low-coverage sequencing data)
CONGA is a genotyping algorithm for Copy Number Variations (large deletions and duplications) in ancient genomes. It is tailored for calling homozygous and heterozygous CNV genotypes at low depths of coverage using read-depth and read-pair information from a BAM file with Illumina short single-end reads.
Please feel free to send me an e-mail ([email protected]), or better yet open an issue for your questions.
CONGA is developed and tested using Linux Ubuntu operating system
- htslib (included as submodule; http://htslib.org/)
- libbz2, liblzma, libcurl are required by htslib
- sonic (included as submodule; https://github.com/calkan/sonic)
Installing development libraries (requires sudo access): "sudo apt-get install zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev"
git clone https://github.com/asylvz/CONGA --recursive
cd CONGA && make libs && make
./conga -i myinput.bam --ref human_g1k_v37.fasta --sonic human_g1k_v37.sonic \
--dels known_dels.bed --dups known_dups.bed --out myoutput
- If you use a X86_64 Linux machine, you can directly use our binary file (under "Relases") after "sudo chmod 755 conga_v1.0_X86_64"
If you do not have root access to install liblzma and/or libbz2, you can compile htslib without CRAM support. However, the libz library is still required, please talk to your admin if it is not available on your system.
make nocram
./conga-nocram -i myinput.bam --ref human_g1k_v37.fasta --sonic human_g1k_v37.sonic \
--dels known_dels.bed --dups known_dups.bed --out myoutput
Another alternative to run CONGA is using Docker
cd docker
docker build . -t conga:latest
Your image named "conga" should be ready. You can run CONGA using this image by
docker run --user=$UID -v /home/projects/conga:/input -v /home/projects/conga:/output conga -i /input/myinput.bam --sonic /input/human_g1k_v37.sonic --ref /input/human_g1k_v37.fasta --dels /input/known_dels.bed --dups /input/known_dups.bed --out /output/mydockertest
Alternatively, you can pull from Docker Hub:
docker pull asylvz/conga
You need to input a SONIC file as input to CONGA (--sonic). This file contains some annotation based on the reference genome that you use. You can use one of the already created ones from: https://github.com/BilkentCompGen/sonic-prebuilt
- human_g1k_v37.sonic: SONIC file for Human Reference Genome GRCh37 (1000 Genomes Project version)
- ucsc_hg19.sonic: SONIC file for the human reference genome, UCSC version build hg19.
- ucsc_hg38.sonic: SONIC file for the human reference genome build 38.
If you are working with a different reference genome, you need to create the SONIC file yourself. This is a straightforward process; please refer to the SONIC development repository: https://github.com/calkan/sonic/
You can use the "svcalls.sh" script under /scripts to generate CNV calls from the 1K Phase 3 SV call set
1 668630 850204
1 963826 974172
1 1171539 1179729
1 1249799 1265722
1 2374226 2379823
...
- The columns are "Chromosome Name" (TAB) "Start Position of a CNV" (TAB) "End Postion of a CNV"
- This file should be seperate for duplications and deletions if both are to be genotyped.
1 63913643 63913648 0.2
1 63913648 63913649 0.25
1 63913649 63913653 0.5
1 63913653 63913659 0.333333
...
Using a mappability file (--mappability) increases the accuracy of CONGA's predictions. We used the 100-mer mappability file from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeMapability/ and converted the bigWig file into a BED file using "bigWigToBedGraph".
- The columns are "Chromosome Name" (TAB) "Start Position of a CNV" (TAB) "End Postion of a CNV" (TAB) "Mappability value"
- Note that the mappability value should be between [0,1], where lower values indicate lower mappability intervals, i.e., repeat-rich regions, etc.
--input [BAM file] : Input files in sorted and indexed BAM format. (required)
--out [output prefix] : Prefix for the output file names. (required)
--ref [reference genome] : Reference genome in FASTA format. (required)
--sonic [sonic file] : SONIC file that contains assembly annotations. (required)
--dels [bed file] : Known deletion SVs in bed format
--dups [bed file] : Known duplication SVs in bed format
--mappability [bed file] : Mappability file in BED format
--first-chr [chromosome index] : The index of the first chromosome for genotyping in your BAM.
--last-chr [chromosome index] : The index of the last chromosome for genotyping in your BAM.
--min-read-length [integer] : Minimum length of a read to be processed for RP (default: 60 bps)
--min-sv-size [integer] : Minimum length of a CNV (default: 1000 bps)
--min-mapq [integer] : Minimum mapping quality threshold for reads (default: -1)
--c-score [float] : Minimum c-score to filter variants (More conservative with lower values, default: 0.5).
--rp [integer] : Enable split-read and set minimum read-pair support for a duplication (Suggested for >5x only).
Information:
--version : Print version and exit.
--help : Print this help screen and exit.
Arda Söylev, Sevim Seda Çokoglu, Dilek Koptekin, Can Alkan, and Mehmet Somel. "CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data." PLOS Computational Biology 18, no. 12 (2022): e1010788. https://doi.org/10.1371/journal.pcbi.1010788