From c694b62a294cafa259cad029ab7631acbc99d61d Mon Sep 17 00:00:00 2001 From: Gaddis Date: Tue, 3 Mar 2020 15:35:01 -0500 Subject: [PATCH] [#3] Initial commit of genotype array QC specs --- genotype_array_qc/README.md | 119 ++++++++++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 genotype_array_qc/README.md diff --git a/genotype_array_qc/README.md b/genotype_array_qc/README.md new file mode 100644 index 0000000..f22215c --- /dev/null +++ b/genotype_array_qc/README.md @@ -0,0 +1,119 @@ +# Genotype Array QC + +## Introduction + +This document details the standard analysis workflow for performing QC data from genotyping arrays. An automated pipeline, developed using WDL, Cromwell, and Docker, is available for this workflow. + +This workflow takes plus-strand GRCh37 genotypes in PLINK bed/bim/fam format and produces the following outputs: + +1. QCed genotypes in PLINK bed/bim/fam format. +2. Summary of variants and subjects removed/flagged during each step of the QC pipeline. + +The input and output formats are fully described in the appendix of this document. + +The steps in this workflow are as follows: + +1. Split by chromosome +2. Convert variants to IMPUTE2 ID format +3. Remove duplicate IDs (based on call rate) +4. Merge chromosomes +5. Flag individuals missing chrX or other chromosome +6. Remove phenotype info in FAM file +7. Format phenotype data to standard format +8. Structure workflow (separate supporting workflow) +9. Partition data by ancestry +10. Call rate filter +11. HWE filter +12. Subject call rate filter (based on autosomes) +13. Relatedness workflow (separate supporting workflow) +14. Remove samples based on relatedness +15. Sex check and sample removal +16. Excessive homozygosity filtering +17. Set het haploids to missing + +Each of these steps in described in detail below. + +### 1. Split by chromosome + +Sample command: +``` shell +plink \ + --bfile [INPUT_BED_BIM_FAM_PREFIX] \ + --chr [CHR] \ + --make-bed \ + --out [OUTPUT_BED_BIM_FAM_PREFIX] +``` + +Input Files: + +| FILE | DESCRIPTION | +| --- | --- | +| `[INPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for input genotypes | +| `[INPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for input genotypes | +| `[INPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for input genotypes | + + +Output Files: + +| FILE | DESCRIPTION | +| --- | --- | +| `[OUTPUT_BED_BIM_FAM_PREFIX].bed` | PLINK format bed file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].bim` | PLINK format bim file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].fam` | PLINK format fam file for output genotypes | +| `[OUTPUT_BED_BIM_FAM_PREFIX].log` | PLINK log file | + + +Parameters: + +| PARAMETER | DESCRIPTION | +| --- | --- | +| `--bfile [INPUT_BED_BIM_FAM_PREFIX]` | Prefix for input genotypes in PLINK bed/bim/fam format | +| `--chr [CHR]` | Chromosome to extract (1-26, X, Y, XY, MT) | +| `--make-bed` | Flag indicating to generate genotypes in PLINK bed/bim/fam format | +| `--out [OUTPUT_BED_BIM_FAM_PREFIX]` | Prefix for output genotypes in PLINK bed/bim/fam format | + + +### 2. Convert variants to IMPUTE2 ID format + +Sample command: +``` shell +convert_to_1000g_ids.pl \ + --file_in [INPUT_BIM_FILE] \ + --file_out [OUTPUT_BIM_FILE] \ + --legend [INPUT_1000G_LEGEND_FILE] \ + --file_in_id_col [ID_COL_NUM] \ + --file_in_chr_col [CHR_COL_NUM] \ + --file_in_pos_col [POS_COL_NUM] \ + --file_in_a1_col [A1_COL_NUM] \ + --file_in_a2_col [A2_COL_NUM] \ + --chr [CHR] +``` + +Input Files: + +| FILE | DESCRIPTION | +| --- | --- | +| `[INPUT_BIM_FILE]` | PLINK format bim file | +| `[INPUT_1000G_LEGEND_FILE]` | IMPUTE2 1000G legend file | + + +Output Files: + +| FILE | DESCRIPTION | +| --- | --- | +| `[OUTPUT_BIM_FILE]` | PLINK format bim file with IDs in IMPUTE2 format | + + +Parameters: + +| PARAMETER | DESCRIPTION | +| --- | --- | +| `--file_in [INPUT_BIM_FILE]` | Path of input bim file | +| `--file_out [OUTPUT_BIM_FILE]` | Path of output bim file | +| `--legend [INPUT_1000G_LEGEND_FILE]` | Path of IMPUTE2 1000G legend file | +| `--file_in_id_col [ID_COL_NUM]` | ID column number (zero-based) | +| `--file_in_chr_col [CHR_COL_NUM]` | Chromosome column number (zero-based) | +| `--file_in_pos_col [POS_COL_NUM]` | Position column number (zero-based) | +| `--file_in_a1_col [A1_COL_NUM]` | Allele 1 column number (zero-based) | +| `--file_in_a2_col [A2_COL_NUM]` | Allele 2 column number (zero-based) | +| `--chr [CHR]` | Chromosome (1-22, X_NONPAR, PAR1, PAR2) |