Skip to content

wwliao/hprc_release2_variant_calling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Variant Calling for the HPRC Release 2

This repository contains Nextflow workflows for assembly-based and HiFi-based variant calling for 231 individuals in the Human Pangenome Reference Consortium (HPRC) Release 2. It includes workflows for aligning assemblies and HiFi reads to reference genomes, as well as workflows for calling small and structural variants from these alignments using multiple tools. Additionally, index files are provided for easy access to assemblies, HiFi reads, alignments, and variant callsets.

Overview

We used Winnowmap (v2.03) to align assemblies and HiFi reads to two reference genomes: GRCh38_no_alt and CHM13v2. However, for PAV, we used Minimap2 (v2.26) instead of Winnowmap. For HiFi reads with CpG methylation information (MM and ML tags), we retained the methylation data in the aligned BAM files.

Variants were then called using various tools. For structural variant (SV) callers, we relaxed the calling criteria to maximize recall by setting the following parameters (where applicable):

  • Minimum MAPQ: 5
  • Minimum read support: 3
  • Minimum SV length: 30 bp

The table below summarizes the current status for each variant caller:

Caller Method Type GRCh38_no_alt CHM13v2
CuteSV-asm (v2.1.1) Assembly-based 231/231 231/231
Dipcall (v0.3) Assembly-based 231/231 231/231
PAV (v2.4.6) Assembly-based 231/231 231/231
SVIM-asm (v1.0.3) Assembly-based 231/231 231/231
SVision-pro-asm (v2.4) Assembly-based 231/231 231/231
DeepVariant (v1.6.1) HiFi-based 231/231 231/231
CuteSV (v2.1.1) HiFi-based 231/231 231/231
DeBreak (v1.3) HiFi-based 231/231 231/231
Delly (v1.3.2) HiFi-based 231/231 231/231
PBSV (v2.10.0) HiFi-based 231/231 231/231
Sawfish (v0.12.8) HiFi-based 231/231 231/231
Sniffles (v2.5.3) HiFi-based 231/231 231/231
SVDSS (v2.0.0) HiFi-based 231/231 231/231
SVIM (v2.0.0) HiFi-based 231/231 231/231
SVision-pro (v2.4) HiFi-based 231/231 231/231

Notes

  • We modified diploid_calling.py in CuteSV-asm (v2.1.1) to support user-defined haplotype names. The modified version is available here.

  • The GT field in SVision-pro-asm (v2.4) VCF output is incorrect (see issue). If needed, use the RNAMES field in INFO to infer the genotype.

  • DeepVariant (v1.8.0) had a bug when used with CHM13v2 (see issue). To ensure consistency, we switched to v1.6.1 for both reference genomes as a workaround.

  • CuteSV (v2.1.1) occasionally reports variants with a position of zero (see issue). We applied a post-processing step to remove these records before sorting to avoid issues with BCFtools.

  • SVDSS (v2.0.0) currently calls only INS and DEL. The GT field in the VCF output is unreliable and should not be used.

  • We modified SVIM (v2.0.0) to fix errors:

    1. Replaced scipy's linkage with fastcluster's linkage for hierarchical clustering.
    2. Updated legendHandles to legend_handles for Matplotlib compatibility.

    The modified version can be found here.

Index Files

Index Type Description File Name
Assemblies List of all assemblies included in HPRC Release 2 assemblies_pre_release_v0.6.1.index.csv
HiFi Reads List of all PacBio HiFi reads used in variant calling hifi_reads.index.csv
Assembly-to-Reference Alignments List of assembly alignments to reference genomes assembly_alignments.index.csv
HiFi-to-Reference Alignments List of HiFi read alignments to reference genomes hifi_alignments.index.csv
Variant Callsets List of all variant callsets generated for each sample variant_callsets.index.csv

How to Download Files

To download files from the AWS S3 bucket, install the AWS Command Line Interface (AWS CLI) if you haven't already, then run:

aws s3 --no-sign-request cp <s3_path> .

Reference Genomes

We provide two reference genome packages: GRCh38_no_alt.tar.gz and CHM13v2.tar.gz. Each package contains all necessary files for the workflows in this repository.

File Structure

After extracting GRCh38_no_alt.tar.gz or CHM13v2.tar.gz, you will find the following files:

  • <reference>.fa: The FASTA file containing the reference genome.
  • <reference>.fa.fai: Index file for the FASTA reference genome.
  • <reference>.fmd: FMD index file for the FASTA reference genome.
  • <reference>.PAR.bed: BED file specifying pseudo-autosomal regions (PARs).
  • <reference>.expected_cn.XX.bed: BED file for expected copy numbers in female samples.
  • <reference>.expected_cn.XY.bed: BED file for expected copy numbers in male samples.
  • <reference>.TRF.bed: BED file annotating tandem repeats.
  • repetitive_k15.txt: Text file listing repetitive k-mers (k=15) pre-computed using meryl.
  • repetitive_k19.txt: Text file listing repetitive k-mers (k=19) pre-computed using meryl.

Replace <reference> with GRCh38_no_alt or CHM13v2, depending on the genome package you are using.

GRCh38_no_alt.fa

The GRCh38_no_alt.fa file is derived from GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz. It was decompressed and renamed to GRCh38_no_alt.fa. This version:

  • Excludes ALT contigs.
  • Has the PARs on chrY hard-masked.
  • Uses the rCRS mitochondrial sequence.
  • Includes the Epstein-Barr Virus (EBV) sequence.

For more details, see Heng Li’s blog post: Which human reference genome to use?.

CHM13v2.fa

The CHM13v2.fa file is derived from chm13v2.0_maskedY_rCRS.fa.gz. It was decompressed, modified to include the EBV sequence, and renamed to CHM13v2.fa. This version:

  • Has the PARs on chrY hard-masked.
  • Uses the rCRS mitochondrial sequence.
  • Includes the EBV sequence.

About

Variant Calling for the HPRC Release 2

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published