Skip to content

Commit

Permalink
v0.8.5
Browse files Browse the repository at this point in the history
  • Loading branch information
frankvogt committed Jan 23, 2022
1 parent 4f95080 commit f6a6e32
Show file tree
Hide file tree
Showing 9 changed files with 684 additions and 319 deletions.
42 changes: 34 additions & 8 deletions MANUAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,32 @@ or select all phenotypes in the phenotype file at once utilizing the `-ap/--allp
vcf2gwas -v [filename] -pf [filename] -ap -lmm
```

### Transforming phenotype values

vcf2gwas offers the option to transform the phenotype values in the phenotype file(s). The selected metric (default: 'wisconsin') is applied across rows.
To transform the phenotypes, employ the `-t/--transform` option and to change the metric to one of the other supported metrics, add it as an argument to the option:

```
vcf2gwas -v [filename] -pf [filename1] -p [int] -lmm -t hellinger
```

Now, vcf2gwas, transforms the phenotypes according to the hellinger metric.
The following metrics are available:
* *total*: Divides each observation by row sum
* *max*: Divides each observation by row max
* *normalize*: Chord transformation, also euclidean normalization, making the length of each row 1
* *range*: Converts the range of the data to 0 and 1
* *standardize*: Standardizes each observation (i.e. z-score)
* *hellinger*: Square-root of the total transformation
* *log*: Returns ln(x+1)
* *logp1*: Returns ln(x) + 1, if x > 0. Otherwise returns 0
* *pa*: Converts data to binary absence (0) presence (1) data
* *wisconsin*: First divides an observation by the max of the column, then the sum of the row. That is, it applies ‘max’ down columns then ‘total’ across rows

These functions are taken from the [ecopy](https://ecopy.readthedocs.io/en/latest/index.html) package.

**Note**: If desired, one can also use the [dimensionality reduction](#using-dimensionality-reduction-of-phenotypes-for-analysis) options in conjunction with the transformation. vcf2gwas will first transform the phenotypes and then reduce the dimensionality of the transformed phenotypes according to the chosen method and use these results as phenotypes.

### Adding covariates

GEMMA supports adding covariates to the linear model and the linear mixed model analysis.
Expand Down Expand Up @@ -195,7 +221,7 @@ If the file is in the `.csv` format, the file needs at least three columns conta
*vcf2gwas* recognizes chromosomes in the following formats (here the first chromosome): `Chr1`, `chr1`, `1`.
If the chromosomes in the `VCF` file are of a different format, it is necessary that the chromosome information in the gene file is formatted in the same way, otherwise *vcf2gwas* won't recognize the information correctly.

*vcf2gwas* will summarize the n best SNPs (specified with `-t/--topsnp`) of every analyzed phenotype and compare them to the genes in the file by calculating the distance between each SNP and gene upstream as well as downstream. These results can be filtered by saving only those SNPs with a distance to a gene lower than a specific threshold (set with `-gt/--genethresh`).
*vcf2gwas* will summarize the n best SNPs (specified with `-ts/--topsnp`) of every analyzed phenotype and compare them to the genes in the file by calculating the distance between each SNP and gene upstream as well as downstream. These results can be filtered by saving only those SNPs with a distance to a gene lower than a specific threshold (set with `-gt/--genethresh`).

**Note**: Since for each SNP only the gene with the closest start/end upstream and downstream is shown, this feature only serves to give a hint of possibly associated genes. Closer inspection by the user is strongly recommended.

Expand Down Expand Up @@ -291,13 +317,13 @@ vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -U 3 -um manhattan

The manhattan metric is now used to calculate the UMAP embeddings.
The following is a list of the available metrics:
* euclidean
* manhattan
* braycurtis
* cosine
* hamming
* jaccard
* hellinger
* *euclidean*
* *manhattan*
* *braycurtis*
* *cosine*
* *hamming*
* *jaccard*
* *hellinger*

#### Using PCs or UMAP embeddings as covariates

Expand Down
54 changes: 31 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ if not specified, all available logical cores minus 1 will be used
minimum allele frequency of sites to be used (default: 0.01)
input value needs to be a value between 0.0 and 1.0

* `-t` / `--topsnp` <int>
* `-ts` / `--topsnp` <int>
number of top SNPs of each phenotype to be summarized (default: 15)
after analysis the specified amount of top SNPs from each phenotype will be considered

Expand All @@ -289,6 +289,12 @@ choose the metric for UMAP to use to compute the distances in high dimensional s
Default: euclidean
Available metrics: euclidean, manhattan, braycurtis, cosine, hamming, jaccard, hellinger

* `-t` / `--transform` <str>
transform the input phenotype file
applies the selected metric across rows
Default: wisconsin
Available metrics: total, max, normalize, range, standardize, hellinger, log, logp1, pa, wisconsin

* `-asc` / `--ascovariate`
Use dimensionality reduction of phenotype file via UMAP or PCA as covariates
Only works in conjunction with `-U` / `--UMAP` or `-P` / `--PCA`
Expand Down Expand Up @@ -340,32 +346,34 @@ The exemplary directory and file structure of the output folder after running a
output/
└── 'model'
├── 'phenotype'
│   ├── QQ
│   │   └── QQ plot figure (.png)
│   ├── GEMMA output file (.txt)
│   ├── GEMMA log file (.txt)
│   ├── best_p-values
│   │   ├── top 1% variants (.csv)
│   │   ├── top 0.1% variants (.csv)
│   │   └── top 0.01 variants (.csv)
│   └── manhattan
│   └── manhattan plot figure (.png)
│ ├── QQ
│ │ └── QQ plot figure (.png)
│ ├── summary file (.txt)
│ ├── GEMMA output file (.txt)
│ ├── GEMMA log file (.txt)
│ ├── best_p-values
│ │ ├── top 1% variants (.csv)
│ │ ├── top 0.1% variants (.csv)
│ │ └── top 0.01 variants (.csv)
│ ├── manhattan
│ │ └── manhattan plot figure (.png)
│ └── significant SNP summary file (.csv)
├── files
   └── files_'file'
   ├── PLINK BED files (.bed, .bim, .fam, .nosex)
   ├── PLINK log file (.log)
   ├── GEMMA relatedness matrix (.txt)
   └── GEMMA log file (.log.txt)
└── files_'file'
├── PLINK BED files (.bed, .bim, .fam, .nosex)
├── PLINK log file (.log)
├── GEMMA relatedness matrix (.txt)
└── GEMMA log file (.log.txt)
├── logs
   └── analysis log file (.txt)
└── analysis log file (.txt)
├── QC
   ├── phenotype QC plot (.png)
├── phenotype QC plot (.png)
│ └── genotype QC plots (.png)
├── vcf2gwas log file (.txt)
── summary
── summarized top SNPs (.csv)
└── top_SNPs
└── phenotype top SNPs (.csv)
├── summary
│ ├── summarized top SNPs (.csv)
── top_SNPs
└── phenotype top SNPs (.csv)
└── vcf2gwas log file (.txt)
```

The names of the directories in quotes as well as the file names will vary based on the selected options and the file and phenotype names.
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

setup(
name='vcf2gwas',
version='0.8.4',
version='0.8.5',
description="Python API for comprehensive GWAS analysis using GEMMA",
license="GNUv3",
author="Frank Vogt",
Expand Down
Binary file modified vcf2gwas/README.pdf
Binary file not shown.
21 changes: 16 additions & 5 deletions vcf2gwas/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@

def main(argvals=argvals):

print("\nvcf2gwas v0.8.4 \n")
version = set_version_number()
print(f"\nvcf2gwas v{version} \n")
print("Initialising..\n")
P = Parser(argvals)
args = sys.argv[1:]
Expand All @@ -51,11 +52,21 @@ def main(argvals=argvals):
covar = P.set_covar()
if lm == None and lmm == None:
if covar != None:
sys.exit(print("Error: A covariate file can only be added when using the linear model ('-lm') or the linear mixed model ('-lmm')"))
msg = "A covariate file can only be added when using the linear model ('-lm') or the linear mixed model ('-lmm')"
raise SyntaxError(msg)

subprocess.run(args)
process = subprocess.run(args)

shutil.rmtree("_vcf2gwas_temp", ignore_errors=True)
if process.returncode != 0:
shutil.rmtree("_vcf2gwas_temp", ignore_errors=True)

return process.returncode

if __name__ == '__main__':
sys.exit(main())
try:
sys.exit(main())
except KeyboardInterrupt as e:
print("\nvcf2gwas interrupted")
shutil.rmtree("_vcf2gwas_temp", ignore_errors=True)
print("Cleaned up temporary files\n")
sys.exit(1)
Loading

0 comments on commit f6a6e32

Please sign in to comment.