v0.8.5

frankvogt · Jan 23, 2022 · f6a6e32 · f6a6e32
1 parent 4f95080
commit f6a6e32
Show file tree

Hide file tree

Showing 9 changed files with 684 additions and 319 deletions.
diff --git a/MANUAL.md b/MANUAL.md
@@ -127,6 +127,32 @@ or select all phenotypes in the phenotype file at once utilizing the `-ap/--allp
 vcf2gwas -v [filename] -pf [filename] -ap -lmm
 ```
 
+### Transforming phenotype values
+
+vcf2gwas offers the option to transform the phenotype values in the phenotype file(s). The selected metric (default: 'wisconsin') is applied across rows.  
+To transform the phenotypes, employ the `-t/--transform` option and to change the metric to one of the other supported metrics, add it as an argument to the option:
+
+```
+vcf2gwas -v [filename] -pf [filename1] -p [int] -lmm -t hellinger
+```
+
+Now, vcf2gwas, transforms the phenotypes according to the hellinger metric.  
+The following metrics are available:
+* *total*: Divides each observation by row sum
+* *max*: Divides each observation by row max
+* *normalize*: Chord transformation, also euclidean normalization, making the length of each row 1
+* *range*: Converts the range of the data to 0 and 1
+* *standardize*: Standardizes each observation (i.e. z-score)
+* *hellinger*: Square-root of the total transformation
+* *log*: Returns ln(x+1)
+* *logp1*: Returns ln(x) + 1, if x > 0. Otherwise returns 0
+* *pa*: Converts data to binary absence (0) presence (1) data
+* *wisconsin*: First divides an observation by the max of the column, then the sum of the row. That is, it applies ‘max’ down columns then ‘total’ across rows
+
+These functions are taken from the [ecopy](https://ecopy.readthedocs.io/en/latest/index.html) package.
+
+**Note**: If desired, one can also use the [dimensionality reduction](#using-dimensionality-reduction-of-phenotypes-for-analysis) options in conjunction with the transformation. vcf2gwas will first transform the phenotypes and then reduce the dimensionality of the transformed phenotypes according to the chosen method and use these results as phenotypes.
+
 ### Adding covariates
 
 GEMMA supports adding covariates to the linear model and the linear mixed model analysis.  
@@ -195,7 +221,7 @@ If the file is in the `.csv` format, the file needs at least three columns conta
 *vcf2gwas* recognizes chromosomes in the following formats (here the first chromosome): `Chr1`, `chr1`, `1`.  
 If the chromosomes in the `VCF` file are of a different format, it is necessary that the chromosome information in the gene file is formatted in the same way, otherwise *vcf2gwas* won't recognize the information correctly.  
 
-*vcf2gwas* will summarize the n best SNPs (specified with `-t/--topsnp`) of every analyzed phenotype and compare them to the genes in the file by calculating the distance between each SNP and gene upstream as well as downstream. These results can be filtered by saving only those SNPs with a distance to a gene lower than a specific threshold (set with `-gt/--genethresh`).  
+*vcf2gwas* will summarize the n best SNPs (specified with `-ts/--topsnp`) of every analyzed phenotype and compare them to the genes in the file by calculating the distance between each SNP and gene upstream as well as downstream. These results can be filtered by saving only those SNPs with a distance to a gene lower than a specific threshold (set with `-gt/--genethresh`).  
 
 **Note**: Since for each SNP only the gene with the closest start/end upstream and downstream is shown, this feature only serves to give a hint of possibly associated genes. Closer inspection by the user is strongly recommended.
 
@@ -291,13 +317,13 @@ vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -U 3 -um manhattan
 
 The manhattan metric is now used to calculate the UMAP embeddings.  
 The following is a list of the available metrics:  
-* euclidean
-* manhattan
-* braycurtis
-* cosine
-* hamming
-* jaccard
-* hellinger
+* *euclidean*
+* *manhattan*
+* *braycurtis*
+* *cosine*
+* *hamming*
+* *jaccard*
+* *hellinger*
 
 #### Using PCs or UMAP embeddings as covariates
 

diff --git a/README.md b/README.md
@@ -270,7 +270,7 @@ if not specified, all available logical cores minus 1 will be used
 minimum allele frequency of sites to be used (default: 0.01)  
 input value needs to be a value between 0.0 and 1.0
 
-* `-t` / `--topsnp` <int>  
+* `-ts` / `--topsnp` <int>  
 number of top SNPs of each phenotype to be summarized (default: 15)  
 after analysis the specified amount of top SNPs from each phenotype will be considered
 
@@ -289,6 +289,12 @@ choose the metric for UMAP to use to compute the distances in high dimensional s
 Default: euclidean  
 Available metrics: euclidean, manhattan, braycurtis, cosine, hamming, jaccard, hellinger
 
+* `-t` / `--transform` <str>  
+transform the input phenotype file  
+applies the selected metric across rows  
+Default: wisconsin  
+Available metrics: total, max, normalize, range, standardize, hellinger, log, logp1, pa, wisconsin
+
 * `-asc` / `--ascovariate`
 Use dimensionality reduction of phenotype file via UMAP or PCA as covariates  
 Only works in conjunction with `-U` / `--UMAP` or `-P` / `--PCA`
@@ -340,32 +346,34 @@ The exemplary directory and file structure of the output folder after running a
 output/
 └── 'model'
     ├── 'phenotype'
-    │   ├── QQ
-    │   │   └── QQ plot figure (.png)
-    │   ├── GEMMA output file (.txt)
-    │   ├── GEMMA log file (.txt)
-    │   ├── best_p-values
-    │   │   ├── top 1% variants (.csv)
-    │   │   ├── top 0.1% variants (.csv)
-    │   │   └── top 0.01 variants (.csv)
-    │   └── manhattan
-    │       └── manhattan plot figure (.png)
+    │   ├── QQ
+    │   │   └── QQ plot figure (.png)
+    │   ├── summary file (.txt)
+    │   ├── GEMMA output file (.txt)
+    │   ├── GEMMA log file (.txt)
+    │   ├── best_p-values
+    │   │   ├── top 1% variants (.csv)
+    │   │   ├── top 0.1% variants (.csv)
+    │   │   └── top 0.01 variants (.csv)
+    │   ├── manhattan
+    │   │   └── manhattan plot figure (.png)
+    │   └── significant SNP summary file (.csv)
     ├── files
-    │   └── files_'file'
-    │       ├── PLINK BED files (.bed, .bim, .fam, .nosex)
-    │       ├── PLINK log file (.log)
-    │       ├── GEMMA relatedness matrix (.txt)
-    │       └── GEMMA log file (.log.txt)
+    │   └── files_'file'
+    │       ├── PLINK BED files (.bed, .bim, .fam, .nosex)
+    │       ├── PLINK log file (.log)
+    │       ├── GEMMA relatedness matrix (.txt)
+    │       └── GEMMA log file (.log.txt)
     ├── logs
-    │   └── analysis log file (.txt)
+    │   └── analysis log file (.txt)
     ├── QC
-    │   ├── phenotype QC plot (.png)
+    │   ├── phenotype QC plot (.png)
     │   └── genotype QC plots (.png)
-    ├── vcf2gwas log file (.txt)
-    └── summary
-        ├── summarized top SNPs (.csv)
-        └── top_SNPs
-            └── phenotype top SNPs (.csv)
+    ├── summary
+    │   ├── summarized top SNPs (.csv)
+    │   └── top_SNPs
+    │       └── phenotype top SNPs (.csv)
+    └── vcf2gwas log file (.txt)
 ```
 
 The names of the directories in quotes as well as the file names will vary based on the selected options and the file and phenotype names.

diff --git a/setup.py b/setup.py
@@ -17,7 +17,7 @@
 
 setup(
     name='vcf2gwas',
-    version='0.8.4',
+    version='0.8.5',
     description="Python API for comprehensive GWAS analysis using GEMMA",
     license="GNUv3",
     author="Frank Vogt",

diff --git a/vcf2gwas/README.pdf b/vcf2gwas/README.pdf
diff --git a/vcf2gwas/__main__.py b/vcf2gwas/__main__.py
@@ -31,7 +31,8 @@
 
 def main(argvals=argvals):
 
-    print("\nvcf2gwas v0.8.4 \n")
+    version = set_version_number()
+    print(f"\nvcf2gwas v{version} \n")
     print("Initialising..\n")
     P = Parser(argvals)
     args = sys.argv[1:]
@@ -51,11 +52,21 @@ def main(argvals=argvals):
     covar = P.set_covar()
     if lm == None and lmm == None:
         if covar != None:
-            sys.exit(print("Error: A covariate file can only be added when using the linear model ('-lm') or the linear mixed model ('-lmm')"))
+            msg = "A covariate file can only be added when using the linear model ('-lm') or the linear mixed model ('-lmm')"
+            raise SyntaxError(msg)
 
-    subprocess.run(args)
+    process = subprocess.run(args)
 
-    shutil.rmtree("_vcf2gwas_temp", ignore_errors=True)
+    if process.returncode != 0:
+        shutil.rmtree("_vcf2gwas_temp", ignore_errors=True)
+
+    return process.returncode
 
 if __name__ == '__main__':
-    sys.exit(main())
+    try:
+        sys.exit(main())
+    except KeyboardInterrupt as e:
+        print("\nvcf2gwas interrupted")
+        shutil.rmtree("_vcf2gwas_temp", ignore_errors=True)
+        print("Cleaned up temporary files\n")
+        sys.exit(1)