From 5b9bb22bdf5250a9bdf9e00223d88cdce6775cbf Mon Sep 17 00:00:00 2001 From: Gaddis Date: Wed, 11 Mar 2020 16:31:26 -0400 Subject: [PATCH] [#3] Updated QC workflow README --- genotype_array_qc/README.md | 332 ++++++++++++++++++++++++++++-------- 1 file changed, 260 insertions(+), 72 deletions(-) diff --git a/genotype_array_qc/README.md b/genotype_array_qc/README.md index 2902853..4e0ee79 100644 --- a/genotype_array_qc/README.md +++ b/genotype_array_qc/README.md @@ -4,7 +4,7 @@ This document details the standard analysis workflow for performing QC data from genotyping arrays. An automated pipeline, developed using WDL, Cromwell, and Docker, is available for this workflow. -This workflow takes plus-strand GRCh37 genotypes in PLINK bed/bim/fam format and produces the following outputs: +This workflow takes plus-strand GRCh37 genotypes in PLINK bed/bim/fam format. The fam file should include the sex for each individual. The workflow produces the following outputs: 1. QCed genotypes in PLINK bed/bim/fam format. 2. Summary of variants and subjects removed/flagged during each step of the QC pipeline. @@ -15,15 +15,32 @@ The input and output formats are fully described in the appendix of this documen The steps in this workflow are as follows:
-1. Split by chromosome +1. Add sex to fam file Sample command: ``` shell +# Create sex mapping file from phenotype file +perl -lne ' + BEGIN { + $header = 1; + $fidCol = -1; + $iidCol = -1; + $sexCol = -1; + } + $delimiter = lc("'[PHENO_DELIMITER]'"); + $delimiter = ($delimiter eq "comma") ? "," : (($delimiter eq "tab") ? "\t" : (($delimiter eq "space") ? " " : "")); + @F = split($delimiter); + if ($header) { + foreach $col (@F) { + + } + } +' plink \ --bfile [INPUT_BED_BIM_FAM_PREFIX] \ - --chr [CHR] \ + --update-sex [SEX_FILE] \ --make-bed \ - --out [OUTPUT_BED_BIM_FAM_PREFIX] + --out /shared/data/studies/vidus/observed/processing/ea/vidus.ea.chr23.snp_miss.with_cidr_sexes ``` Input Files: @@ -56,106 +73,258 @@ Parameters:
+
-2. Convert variants to IMPUTE2 ID format +2. Sex check Sample command: ``` shell -convert_to_1000g_ids.pl \ - --file_in [INPUT_BIM_FILE] \ - --file_out [OUTPUT_BIM_FILE] \ - --legend [INPUT_1000G_LEGEND_FILE] \ - --file_in_id_col [ID_COL_NUM] \ - --file_in_chr_col [CHR_COL_NUM] \ - --file_in_pos_col [POS_COL_NUM] \ - --file_in_a1_col [A1_COL_NUM] \ - --file_in_a2_col [A2_COL_NUM] \ - --chr [CHR] +# Run sex check +plink \ + --bfile [INPUT_BED_BIM_FAM_PREFIX] \ + --check-sex \ + --out [OUTPUT_PREFIX] + +# Rename output file +perl -lane 'print join("\t",@F);' [OUTPUT_PREFIX].sexcheck > [OUTPUT_PREFIX].sexcheck.all.tsv + +# Extract subjects not passing sex check +head -n 1 [OUTPUT_PREFIX].sexcheck.all.tsv > [OUTPUT_PREFIX].sexcheck.problems.tsv +grep PROBLEM [OUTPUT_PREFIX].sexcheck.all.tsv >> [OUTPUT_PREFIX].sexcheck.problems.tsv + +# Create remove list +tail -n +2 [OUTPUT_PREFIX].sexcheck.problems.tsv | + perl -lane 'print join("\t", $F[0], $F[1]);' > [OUTPUT_PREFIX].sexcheck.remove.tsv ``` Input Files: | FILE | DESCRIPTION | | --- | --- | -| `[INPUT_BIM_FILE]` | PLINK format bim file | -| `[INPUT_1000G_LEGEND_FILE]` | IMPUTE2 1000G legend file | +| `[INPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for input genotypes | +| `[INPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for input genotypes | +| `[INPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for input genotypes | Output Files: | FILE | DESCRIPTION | | --- | --- | -| `[OUTPUT_BIM_FILE]` | PLINK format bim file with IDs in IMPUTE2 format | +| `[OUTPUT_PREFIX].sexcheck.all.tsv` | PLINK sex check output for all subjects | +| `[OUTPUT_PREFIX].sexcheck.problems.tsv` | PLINK sex check output for subjects not passing sex check | +| `[OUTPUT_PREFIX].sexcheck.remove.tsv` | List of subjects not passing sex check that can be fed into PLINK to remove the subjects | Parameters: | PARAMETER | DESCRIPTION | | --- | --- | -| `--file_in [INPUT_BIM_FILE]` | Path of input bim file | -| `--file_out [OUTPUT_BIM_FILE]` | Path of output bim file | -| `--legend [INPUT_1000G_LEGEND_FILE]` | Path of IMPUTE2 1000G legend file | -| `--file_in_id_col [ID_COL_NUM]` | ID column number (zero-based) | -| `--file_in_chr_col [CHR_COL_NUM]` | Chromosome column number (zero-based) | -| `--file_in_pos_col [POS_COL_NUM]` | Position column number (zero-based) | -| `--file_in_a1_col [A1_COL_NUM]` | Allele 1 column number (zero-based) | -| `--file_in_a2_col [A2_COL_NUM]` | Allele 2 column number (zero-based) | -| `--chr [CHR]` | Chromosome (1-22, X_NONPAR, PAR1, PAR2) | +| `--bfile [INPUT_BED_BIM_FAM_PREFIX]` | Prefix for input genotypes in PLINK bed/bim/fam format | +| `--sex-check` | Flag indicating that PLINK shoud perform a sex check | +| `--out [OUTPUT_BED_BIM_FAM_PREFIX]` | Prefix for output genotypes in PLINK bed/bim/fam format | +
-3. Remove duplicate IDs (based on call rate) +3. Split by chromosome Sample command: ``` shell +plink \ + --bfile [INPUT_BED_BIM_FAM_PREFIX] \ + --chr [CHR] \ + --make-bed \ + --out [OUTPUT_BED_BIM_FAM_PREFIX] ``` Input Files: | FILE | DESCRIPTION | | --- | --- | +| `[INPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for input genotypes | +| `[INPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for input genotypes | +| `[INPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for input genotypes | Output Files: | FILE | DESCRIPTION | | --- | --- | +| `[OUTPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].log` | PLINK log file | Parameters: | PARAMETER | DESCRIPTION | | --- | --- | +| `--bfile [INPUT_BED_BIM_FAM_PREFIX]` | Prefix for input genotypes in PLINK bed/bim/fam format | +| `--chr [CHR]` | Chromosome to extract (1-26, X, Y, XY, MT) | +| `--make-bed` | Flag indicating to generate genotypes in PLINK bed/bim/fam format | +| `--out [OUTPUT_BED_BIM_FAM_PREFIX]` | Prefix for output genotypes in PLINK bed/bim/fam format | +
-4. Merge chromosomes +4. Convert variants to IMPUTE2 ID format Sample command: ``` shell -for prefix in $([PREFIX_LIST]); do - if [ [FORMAT] == "bed_bim_fam" ]; then - echo $prefix.bed $prefix.bim $prefix.fam - elif [ [FORMAT] == "ped_map" ]; then - echo $prefix.ped $prefix.map - fi -done > $fileMergeList +convert_to_1000g_ids.pl \ + --file_in [INPUT_BIM_FILE] \ + --file_out [OUTPUT_BIM_FILE] \ + --legend [INPUT_1000G_LEGEND_FILE] \ + --file_in_id_col [ID_COL_NUM] \ + --file_in_chr_col [CHR_COL_NUM] \ + --file_in_pos_col [POS_COL_NUM] \ + --file_in_a1_col [A1_COL_NUM] \ + --file_in_a2_col [A2_COL_NUM] \ + --chr [CHR] +``` -plink \ - --merge-list $fileMergeList \ +Input Files: + +| FILE | DESCRIPTION | +| --- | --- | +| `[INPUT_BIM_FILE]` | PLINK format bim file | +| `[INPUT_1000G_LEGEND_FILE]` | IMPUTE2 1000G legend file | + + +Output Files: + +| FILE | DESCRIPTION | +| --- | --- | +| `[OUTPUT_BIM_FILE]` | PLINK format bim file with IDs in IMPUTE2 format | + + +Parameters: + +| PARAMETER | DESCRIPTION | +| --- | --- | +| `--file_in [INPUT_BIM_FILE]` | Path of input bim file | +| `--file_out [OUTPUT_BIM_FILE]` | Path of output bim file | +| `--legend [INPUT_1000G_LEGEND_FILE]` | Path of IMPUTE2 1000G legend file | +| `--file_in_id_col [ID_COL_NUM]` | ID column number (zero-based) | +| `--file_in_chr_col [CHR_COL_NUM]` | Chromosome column number (zero-based) | +| `--file_in_pos_col [POS_COL_NUM]` | Position column number (zero-based) | +| `--file_in_a1_col [A1_COL_NUM]` | Allele 1 column number (zero-based) | +| `--file_in_a2_col [A2_COL_NUM]` | Allele 2 column number (zero-based) | +| `--chr [CHR]` | Chromosome (1-22, X_NONPAR, PAR1, PAR2) | +
+ + +
+5. Remove duplicate variant IDs (based on call rate) and merge chromosomes + +Sample command: +``` shell +# Create merge list based on [BED_FILES], [BIM_FILES], and [FAM_FILES] + +# Get sorted list of all variants +for bim in $(perl -lane 'print $F[1];' [MERGE_LIST]); do + perl -lane 'print $F[1];' $bim +done | sort > tmp.variants.sorted + +# Get sorted list of unique variants +sort -u tmp.variants.sorted > tmp.variants.sorted.unique + +# Get list of duplicate variants +comm -23 tmp.variants.sorted \ + tmp.variants.sorted.unique | + sort -u > \ + tmp.variants.duplicates + +# Append ___X (where X is a unique number for the variant) to the end of the variant IDs for duplicates +for bim in $(perl -lane 'print $F[1];' [MERGE_LIST]); do + perl -i.bak -lane ' + BEGIN { + %duplicates = (); + open(DUPLICATES, "'tmp'.variants.duplicates"); + while () { + chomp; + $duplicates{$_} = 1; + } + close DUPLICATES; + } + if (exists($duplicates{$F[1]})) { + $F[1] = $F[1]."___".($duplicates{$F[1]}++); + } + print join("\t", @F);' $bim +done + +# Merge chromosomes +/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \ + --merge-list [MERGE_LIST] \ + --make-bed \ + --out tmp.with_duplicates + +# Get list of duplicate SNPs +grep ___ tmp.with_duplicates.bim | + perl -lane 'print $F[1];' > tmp.with_duplicates.duplicates + +# Get call rates for duplicate SNPs +/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \ + --bfile tmp.with_duplicates \ + --extract tmp.with_duplicates.duplicates \ + --missing \ + --out tmp.with_duplicates.missing + +# Create remove list containing duplicate with higher missing rate +tail -n +2 tmp.with_duplicates.missing.lmiss | + perl -lane ' + BEGIN { + %missingness = (); + %extract = (); + @variants = (); + } + push(@variants, $F[1]); + $F[1] =~ /^(\S+)___\d+/; + $baseName = $1; + if (exists($missingness{$baseName})) { + if ($F[4] < $missingness{$baseName}) { + $missingness{$baseName} = $F[4]; + $extract{$baseName} = $F[1]; + } + } else { + $missingness{$baseName} = $F[4]; + $extract{$baseName} = $F[1]; + } + END { + foreach $variant (@variants) { + $variant =~ /^(\S+)___\d+/; + $baseName = $1; + if ($extract{$baseName} ne $variant) { + print $variant; + } + } + }' > tmp.with_duplicates.remove + +# Remove duplicates with higher missing rate +/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \ + --bfile tmp.with_duplicates \ + --exclude tmp.with_duplicates.remove \ --make-bed \ --out [OUTPUT_BED_BIM_FAM_PREFIX] -rm $fileMergeList +# Remove numbers from duplicate variant IDs +perl -i.bak -lne 's/___\d+//; print;' [OUTPUT_BED_BIM_FAM_PREFIX].bim + +# Clean up files +rm tmp* ``` Input Files: | FILE | DESCRIPTION | | --- | --- | - +| `[MERGE_LIST]` | PLINK format merge-list (see https://www.cog-genomics.org/plink/1.9/data#merge_list) | +| `[BED_FILES]` | Array of bed files to merge in same order as `[BIM_FILES]` and `[FAM_FILES]` | +| `[BIM_FILES]` | Array of bim files to merge in same order as `[BED_FILES]` and `[FAM_FILES]` | +| `[FAM_FILES]` | Array of fam files to merge in same order as `[BED_FILES]` and `[BIM_FILES]` | Output Files: @@ -171,19 +340,21 @@ Parameters: | PARAMETER | DESCRIPTION | | --- | --- | -| `--prefix_list [PREFIX_LIST]` | List of prefixes of files to be merged | -| `--format [FORMAT]` | Format of files to be merged (bed_bim_fam, ped_map) | +| `--bed_files [BED_FILES]` | Delimited list of bed files to merge in same order as bim and fam files | +| `--bim_files [BIM_FILES]` | Delimited list of bim files to merge in same order as bed and fam files | +| `--fam_files [FAM_FILES]` | Delimited list of fam files to merge in same order as bed and bim files | +| `--out [OUTPUT_BED_BIM_FAM_PREFIX]` | Prefix for output genotypes in PLINK bed/bim/fam format |
-5. Flag individuals missing chrX or other chromosome +6. Flag individuals missing chrX or other chromosome
-6. Remove phenotype info in FAM file +7. Remove phenotype info in FAM file Sample command: ``` @@ -214,7 +385,7 @@ Parameters:
-7. Format phenotype data to standard format +8. Format phenotype data to standard format Sample command: ``` shell @@ -240,7 +411,7 @@ Parameters:
-8. Structure workflow (separate supporting workflow) +9. Structure workflow (separate supporting workflow) Sample command: ``` shell @@ -266,7 +437,7 @@ Parameters:
-9. Partition data by ancestry +10. Partition data by ancestry Sample command: ``` shell @@ -309,7 +480,7 @@ Parameters:
-10. Call rate filter +11. Call rate filter Sample command: ``` shell @@ -351,7 +522,7 @@ Parameters:
-11. HWE filter +12. HWE filter Sample command: ``` shell @@ -427,7 +598,7 @@ Parameters:
-12. Set het haploids to missing +13. Set het haploids to missing Sample command: ``` shell @@ -469,7 +640,7 @@ Parameters:
-13. Subject call rate filter (based on autosomes) +14. Subject call rate filter (based on autosomes) Sample command: ``` shell @@ -523,7 +694,7 @@ Parameters:
-14. Relatedness workflow (separate supporting workflow) +15. Relatedness workflow (separate supporting workflow) Sample command: ``` shell @@ -549,97 +720,114 @@ Parameters:
-15. Remove samples based on relatedness (optional) +16. Excessive homozygosity filtering Sample command: ``` shell -plink \ - --bfile [INPUT_BED_BIM_FAM_PREFIX] \ - --remove [REMOVE_LIST] \ - --make-bed \ - --out [OUTPUT_BED_BIM_FAM_PREFIX] ``` Input Files: | FILE | DESCRIPTION | | --- | --- | -| `[INPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for input genotypes | -| `[INPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for input genotypes | -| `[INPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for input genotypes | -| `[REMOVE_LIST]` | List of subjects to remove | Output Files: | FILE | DESCRIPTION | | --- | --- | -| `[OUTPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for output genotypes | -| `[OUTPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for output genotypes | -| `[OUTPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for output genotypes | -| `[OUTPUT_BED_BIM_FAM_PREFIX].log` | PLINK log file | Parameters: | PARAMETER | DESCRIPTION | | --- | --- | -| `--bfile [INPUT_BED_BIM_FAM_PREFIX]` | Prefix for input genotypes in PLINK bed/bim/fam format | -| `--remove [REMOVE_LIST]` | List of subjects to remove | -| `--make-bed` | Flag indicating to generate genotypes in PLINK bed/bim/fam format | -| `--out [OUTPUT_BED_BIM_FAM_PREFIX]` | Prefix for output genotypes in PLINK bed/bim/fam format |
-16. Sex check and sample removal +17. Remove samples based on relatedness (optional) Sample command: ``` shell +plink \ + --bfile [INPUT_BED_BIM_FAM_PREFIX] \ + --remove [REMOVE_LIST] \ + --make-bed \ + --out [OUTPUT_BED_BIM_FAM_PREFIX] ``` Input Files: | FILE | DESCRIPTION | | --- | --- | +| `[INPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for input genotypes | +| `[INPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for input genotypes | +| `[INPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for input genotypes | +| `[REMOVE_LIST]` | List of subjects to remove | Output Files: | FILE | DESCRIPTION | | --- | --- | +| `[OUTPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].log` | PLINK log file | Parameters: | PARAMETER | DESCRIPTION | | --- | --- | +| `--bfile [INPUT_BED_BIM_FAM_PREFIX]` | Prefix for input genotypes in PLINK bed/bim/fam format | +| `--remove [REMOVE_LIST]` | List of subjects to remove | +| `--make-bed` | Flag indicating to generate genotypes in PLINK bed/bim/fam format | +| `--out [OUTPUT_BED_BIM_FAM_PREFIX]` | Prefix for output genotypes in PLINK bed/bim/fam format |
-17. Excessive homozygosity filtering +18. Remove samples based on discrepant sex (optional) Sample command: ``` shell +plink \ + --bfile [INPUT_BED_BIM_FAM_PREFIX] \ + --remove [REMOVE_LIST] \ + --make-bed \ + --out [OUTPUT_BED_BIM_FAM_PREFIX] ``` Input Files: | FILE | DESCRIPTION | | --- | --- | +| `[INPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for input genotypes | +| `[INPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for input genotypes | +| `[INPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for input genotypes | +| `[REMOVE_LIST]` | List of subjects to remove | Output Files: | FILE | DESCRIPTION | | --- | --- | +| `[OUTPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].log` | PLINK log file | Parameters: | PARAMETER | DESCRIPTION | | --- | --- | +| `--bfile [INPUT_BED_BIM_FAM_PREFIX]` | Prefix for input genotypes in PLINK bed/bim/fam format | +| `--remove [REMOVE_LIST]` | List of subjects to remove | +| `--make-bed` | Flag indicating to generate genotypes in PLINK bed/bim/fam format | +| `--out [OUTPUT_BED_BIM_FAM_PREFIX]` | Prefix for output genotypes in PLINK bed/bim/fam format |