The current working directory is here:
/SAN/vyplab/UCLex/
Scripts in pipeline:
annovar_vcf_combine_VP.py
case_control.R
combinegVCF.sh
crunch_controls.pl
crunch_data.R
custom_filtering.pl
deprecated
make_matrix_calls.pl
msample_calling.sh
process_multiVCF.R
split_data_by_chromosomes.R
Files created by pipeline:
Created by GenotypeGVCFs:
mainset_September2015_chr9.vcf.gz
mainset_September2015_chr9.vcf.gz.tbi
Created by Annovar:
mainset_September2015_chr9_db
mainset_September2015_chr9_db.log
GATK filter and VSQR:
mainset_September2015_chr9_filtered.vcf.idx
mainset_September2015_chr9_indels_filtered.vcf.gz
mainset_September2015_chr9_indels_filtered.vcf.gz.tbi
mainset_September2015_chr9_indels.vcf.gz.tbi
mainset_September2015_chr9_recal_plots_snps.R
mainset_September2015_chr9_recal_plots_snps.R.pdf
mainset_September2015_chr9_SNPs_combrec.recal
mainset_September2015_chr9_SNPs_combrec.recal.idx
mainset_September2015_chr9_SNPs_combtranch
mainset_September2015_chr9_SNPs_filtered.vcf.gz.tbi
mainset_September2015_chr9_SNPs.vcf.gz.tbi
combinegVCF.sh
The following are all steps in msample_calling.sh
:
Input:
if [[ "$format" == "v1" ]]; then gVCF=${path}/chr${chr}/${id}.gvcf.gz; fi
if [[ "$format" == "v2" ]]; then gVCF=${path}/${id}-chr${chr}.gvcf.gz; fi
if [[ "$format" == "v3" ]]; then gVCF=${path}/chr${chr}.gvcf.gz; fi
Output:
${output}_chr${chr}.vcf.gz
Code:
$java -Xmx${memoSmall}g -jar $GATK -R $fasta \\
-T GenotypeGVCFs \\
-L $chr -L $target --interval_set_rule INTERSECTION --interval_padding 100 \\
--annotation InbreedingCoeff --annotation QualByDepth --annotation HaplotypeScore --annotation MappingQualityRankSumTest --annotation ReadPosRankSumTest --annotation FisherStrand \\
--dbsnp ${bundle}/dbsnp_137.b37.vcf \\" >> cluster/submission/subscript_chr${chr}.sh
while read path id format; do
if [[ "$format" == "v1" ]]; then gVCF=${path}/chr${chr}/${id}.gvcf.gz; fi
if [[ "$format" == "v2" ]]; then gVCF=${path}/${id}-chr${chr}.gvcf.gz; fi
if [[ "$format" == "v3" ]]; then gVCF=${path}/chr${chr}.gvcf.gz; fi
echo "Including $gVCF"
if [ ! -s $gVCF ]; then stop "Cannot find $gVCF"; fi
if [ ! -s $gVCF.tbi ]; then stop "Cannot find $gVCF.tbi"; fi
echo " --variant $gVCF \\" >> $script
done < <(tail -n +2 $gVCFlist)
echo " -o ${output}_chr${chr}.vcf.gz" >> $script
Input:
${output}_chr${chr}.vcf.gz
Output:
${output}_chr${chr}_filtered.vcf
Output:
-o ${output}_chr${chr}_indels.vcf.gz
Code:
$java -Djava.io.tmpdir=${tmpDir} -Xmx${memoSmall}g -jar ${GATK} \
-T SelectVariants \
-R $fasta \
-V ${output}_chr${chr}.vcf.gz \
-selectType INDEL \
-selectType MIXED \
-o ${output}_chr${chr}_indels.vcf.gz
Code:
$java -Djava.io.tmpdir=${tmpDir} -Xmx${memoSmall}g -jar ${GATK} \
-T VariantFiltration \
-R $fasta \
-V ${output}_chr${chr}_indels.vcf.gz \
--filterExpression \"QD < 2.0 || FS > 50.0 || ReadPosRankSum < -20.0\" \
--filterName \"FAIL\" \
-o ${output}_chr${chr}_indels_filtered.vcf.gz
Output:
${output}_chr${chr}_SNPs.vcf.gz
Code:
$java -Djava.io.tmpdir=${tmpDir} -Xmx${memoSmall}g -jar ${GATK} \
-T SelectVariants \
-R $fasta \
-V ${output}_chr${chr}.vcf.gz \
-selectType SNP \
-o ${output}_chr${chr}_SNPs.vcf.gz
Input:
${output}_chr${chr}_SNPs.vcf.gz
Output:
-recalFile ${output}_chr${chr}_SNPs_combrec.recal \
-tranchesFile ${output}_chr${chr}_SNPs_combtranch \
-rscriptFile ${output}_chr${chr}_recal_plots_snps.R
Code:
$java -Djava.io.tmpdir=${tmpDir} -Xmx${memoSmall}g -jar ${GATK} -R $fasta
-T VariantRecalibrator
-L $chr --input ${output}_chr${chr}_SNPs.vcf.gz --maxGaussians ${maxGauLoc} --mode SNP \
-resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 ${bundle}/hapmap_3.3.b37.vcf \
-resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 ${bundle}/1000G_omni2.5.b37.vcf \
-resource:dbsnp,VCF,known=true,training=false,truth=false,prior=8.0 ${bundle}/dbsnp_137.b37.vcf \
-an QD -an FS -an ReadPosRankSum -an InbreedingCoeff \
-tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
--minNumBadVariants ${numBad} \
-recalFile ${output}_chr${chr}_SNPs_combrec.recal \
-tranchesFile ${output}_chr${chr}_SNPs_combtranch \
-rscriptFile ${output}_chr${chr}_recal_plots_snps.R
Make the R plots
${Rscript} ${output}_chr${chr}_recal_plots_snps.R
Input:
${output}_chr${chr}_SNPs.vcf.gz
Output:
${output}_chr${chr}_SNPs_filtered.vcf.gz
Code:
$java -Xmx${memoSmall}g -jar ${GATK} -R $fasta
-T ApplyRecalibration
-o ${output}_chr${chr}_SNPs_filtered.vcf.gz \
--ts_filter_level 99.5 \
--recal_file ${output}_chr${chr}_SNPs_combrec.recal --tranches_file ${output}_chr${chr}_SNPs_combtranch --mode SNP \
--input ${output}_chr${chr}_SNPs.vcf.gz
Input:
--variant:SNPs ${output}_chr${chr}_SNPs_filtered.vcf.gz \
--variant:indels ${output}_chr${chr}_indels_filtered.vcf.gz \
Output:
${output}_chr${chr}_filtered.vcf
Code:
$java -Djava.io.tmpdir=${tmpDir} -Xmx${memoSmall}g -jar ${GATK} \
-T CombineVariants --assumeIdenticalSamples \
-R $fasta \
--variant:SNPs ${output}_chr${chr}_SNPs_filtered.vcf.gz \
--variant:indels ${output}_chr${chr}_indels_filtered.vcf.gz \
-genotypeMergeOptions PRIORITIZE \
-priority SNPs,indels \
-o ${output}_chr${chr}_filtered.vcf
Remove intermediate files:
rm ${output}_chr${chr}_indels.vcf.gz ${output}_chr${chr}_SNPs.vcf.gz ${output}_chr${chr}_SNPs_filtered.vcf.gz
Perl script to print out as missing, dodgy looking variants.
perl ${baseFolder}/GATK_v2/custom_filtering.pl ${output}_chr${chr}_filtered.vcf ${output}_chr${chr}_recal_filtered2.vcf ${GQ}
Input:
${output}_chr${chr}_filtered.vcf
Output:
${output}_chr${chr}_recal_filtered2.vcf
If good is false will print as missing:
if ($geno eq "0\/0") {
````````if ($GQ <= $locth) {$good = 0;}
```` } else {
````````my @PLs = split(',', $PL);
````````my $PL0 = $PLs[ 0 ];
````````if ($PL0 <= $locth) {$good = 0;}
```` }
```` #if ($geno eq "1\/1") {$locth = 9;} ## for ALT/ALT calls, a 15 QUAL will do in any case
```` #if ($geno eq "2\/2") {$locth = 9;} ## for ALT/ALT calls, a 15 QUAL will do in any case
```` #if ($geno eq "3\/3") {$locth = 9;} ## for ALT/ALT calls, a 15 QUAL will do in any case
```` if ($ezSNP) { ##if nice clean di-allelic SNP we add an extra filter
````````if ($geno eq "0\/1") {
```````` my @counts = split(',', $AD);
```````` my $total = $counts[0] + $counts[1];
```````` if ($counts[1] > $counts[0] + 1) {$locth = 9;} ##if the alternate count is higher then we can be relax, because the alternative is probably
```````` if ($total >= 10) {
````````````my $ratio = $counts[1]/$total;
````````````my $limit = 0.18;
````````````if ($ratio <= $limit) {$good = 0;}
```````` }
````````}
}
${output}_chr${chr}_recal_filtered2.vcf
${output}_chr${chr}_exome_table.csv
Code:
cut -f1-8 ${output}_chr${chr}_filtered.vcf > ${output}_chr${chr}_for_annovar.vcf
/cluster/project8/vyp/vincent/Software_heavy/annovar_Feb2013/convert2annovar.pl --allallele -format vcf4 --includeinfo ${output}_chr${chr}_for_annovar.vcf > ${output}_chr${chr}_db
/cluster/project8/vyp/vincent/Software_heavy/annovar_Feb2013/summarize_annovar_VP.pl -ver1000g 1000g2012apr -verdbsnp 137 -veresp 6500si -alltranscript -buildver hg19 --genetype ensgene --remove ${output}_chr${chr}_db /cluster/project8/vyp/vincent/Software_heavy/annovar_Feb2013/humandb_hg19/
Custom filtering.
Input:
${output}_chr${chr}_filtered.vcf
Output:
${output}_chr${chr}_recal_filtered2.vcf
Code:
perl ${baseFolder}/GATK_v2/custom_filtering.pl ${output}_chr${chr}_filtered.vcf ${output}_chr${chr}_recal_filtered2.vcf ${GQ}
Combine ANNOVAR and VCF.
Input:
${output}_chr${chr}_recal_filtered2.vcf
${output}_chr${chr}_db.exome_summary.csv
Output:
${output}_chr${chr}_exome_table.csv
Code:
python ${baseFolder}/GATK_v2/annovar_vcf_combine_VP.py ${output}_chr${chr}_recal_filtered2.vcf ${output}_chr${chr}_db.exome_summary.csv ${output}_chr${chr}_exome_table.csv
Extract genotypes: Input:
${output}_chr${chr}_exome_table.csv
Output:
my $callFile = $output."_calls.tab";
my $depthFile = $output."_depth.tab";
my $colnameFile = $output."_colname.tab";
my $rownameFile = $output."_rowname.tab";
my $annotationFile = $output."_annotations.csv";
Code:
perl ${baseFolder}/msample_calling/make_matrix_calls.pl ${output}_chr${chr}_exome_table.csv ${output} $chr
Output:
${output}_snpStats/chr${chr}_snpStats.RData
Code:
R CMD BATCH --no-save --no-restore --chromosome=${chr} --root=${output} ${baseFolder}/msample_calling/convert_to_R.R cluster/R/convert_to_R_chr${chr}.out
What's a crunch?
perl ${baseFolder}/GATK_v2/crunch_controls.pl ${output}_chr${chr}_exome_table.csv $keyWords $casekeyWords ${output}_chr${chr}_exome_crunched.csv data/sampleList_exome.tab none no ##include all samples
Input:
keyWords=data/controlKeywords.tab
casekeyWords=data/caseKeywords.tab
${output}_chr${chr}_exome_table.csv
Output:
${output}_chr${chr}_exome_crunched.csv
Include all samples:
$inputTable = $ARGV[0];
$controlKeywords = $ARGV[1];
$caseKeywords = $ARGV[2];
$outputTable = $ARGV[3];
$sampleList = $ARGV[4];
$excludedControlList = $ARGV[5];
$chooseExternalControlsRandom = $ARGV[6];
crunch_controls.pl ${output}_chr${chr}_exome_table.csv data/controlKeywords.tab data/caseKeywords.tab ${output}_chr${chr}_exome_crunched.csv data/sampleList_exome.tab none no
Output:
save(list = c('combined.matrix.depth', 'combined.snpStats', 'combined.annotations'), file = 'data/poly_in_cases_snpStats.RData')
Code:
crunch_data.R
process_multiVCF.R
sampleExclusion=/cluster/project8/vyp/exome_sequencing_multisamples/mainset/support/exclusion_lists/control_exclusion.tab
head -300 ${output}_chr1 | awk '{ if (\$1 ~ /^#/) print}' > ${output}.vcf
if [ ! -e ${output}_chr${chr} ]; then
echo "Missing file ${output}_chr${chr}"
else
echo "awk '{ if (\$1 !~ /^#/) print}' ${output}_chr${chr} >> ${output}.vcf" >> $mainScript
fi
done
fi