added usage description to README. added more keywords to setup. adde…

…d back the de novo files and co-segregate variant file inputs.
ding-lab · Apr 9, 2017 · c472f97 · c472f97
1 parent 79228a5
commit c472f97
Show file tree

Hide file tree

Showing 4 changed files with 158 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -21,3 +21,121 @@ Run:
 For detailed help/support, email Adam:
 
 	[email protected]
+
+## Usage Details
+### Input data
+	-m Standard .maf
+	-f Standard .vcf
+	-T Custom .tsv
+Variant data may be input via at least one variant file. 
+This means that if variants are spread across several files, then you can input one of each type. 
+For the .maf and .tsv, use the custom columns to determine which columns to use. 
+Note that a standard .maf does not include protein annotations. 
+Use the custom column for the peptide change column. 
+If your .vcf has VEP annotations, then CharGer should be able to parse the information. 
+This information will be added to your variants when available.
+
+### Output
+	-o output file
+	-w output as HTML (flag)
+	-k annotate input (flag)
+	--run-url-test test url when creating links
+Name your output file; otherwise it will be called charger_summary.tsv. 
+You can opt to make the output into an HTML page, instead of a readable .tsv. 
+If you need to be assured of properly linked URL's, use the url test flag. 
+
+### Access data
+	-l ClinVar (flag)
+	-x ExAC (flag)
+	-E VEP (flag)
+	-t TCGA cancer types (flag)
+Using these flags turns on accession features built in. 
+For the ClinVar, ExAC, and VEP flags, if no local VEP or databse is provided, then BioMine will be used to access the ReST interface. 
+The TCGA flag allows disease determination from sample barcodes in a .maf when using a diseases file (see below). 
+
+### Suppress data or overrides
+	-O override with ClinVar description (flag)
+	-D suppress needing disease specific (flag)
+You can have CharGer override its pathogenic characterization with whatever ClinVar has. 
+Suppressing disease specific variants takes any variants in the diseases file (see below) and treats them as equally pathogenic without disease consideration.
+
+### Cross-reference data
+	-z pathogenic variants, .vcf
+	-e expression matrix file, .tsv
+	-g gene list file, (format: gene\\tdisease\\tmode_of_inheritance) .txt
+	-d diseases file, (format: gene\\tdisease\\tmode_of_inheritance) .tsv
+	-n de novo file, standard .maf
+	-a assumed de novo file, standard .maf
+	-c co-segregation file, standard .maf
+	-H HotSpot3D clusters file, .clusters
+	-r recurrence threshold (default = 2)
+Variants or genes from each of these files can be used as additional known information. 
+An expression matrix file has columns for each sample, and its rows are genes. 
+The genes should be approved HUGO symbols. 
+HotSpot3D clusters can be used for versions v1.x.x. 
+The recurrence threshold will be pulled from the recurrence/weight column of the .clusters file when provided.
+
+### Local VEP
+	--vep-script Path to VEP
+	--vep-dir Path to VEP directory
+	--vep-cache Path to VEP cache directory
+	--vep-version VEP version (default = 87)
+	--vep-output VEP output file (default = charger.vep.vcf)
+	--grch assembly GRCh verion (default = 37)
+	--ensembl-release Ensembl release version (default = 75)
+	--reference-fasta VEP reference fasta
+	--fork Number of forked processes used in VEP (default = 0) 
+This currently only works with .vcf input only. 
+Annotations are run with the VEP everything flag, so any local plugins will be used. 
+The BioMine accession is also suppressed when using a local VEP installaltion. 
+The VEP directory is not the same as would be given to VEP's --dir option. 
+Instead it is the path to the directory with the VEP .pl file. 
+The VEP script is the .pl file only. 
+If not given, it will be /vep-dir/variant\_effect\_predictor.pl. 
+The VEP cache directory is the same as would be given to VEP's --dir-cache option. 
+If you have multiple VEP versions, then specify the version you want to use. 
+This can be different from the Ensembl release option. 
+VEP output is the same os would be given to VEP's -o option and should end with .vcf. 
+The default output file will be called charger.vep.vcf. 
+The GRCh reference genome can be set to either 37 or 38. 
+The reference Fasta file will be deteremined automatically if not specified. 
+If the reference Fasta file is constructed automatically, then if, for example, the VEP chache is ~/.vep/, the Ensembl release is 74, and the reference assembly is 37, then the reference Fasta file will be ~/.vep/homo\_sapiens/74\_GRCH37/Homo\_sapiens.GRCh37.74.dna.primary\_assembly.fa.gz.  
+
+### Local databases
+	--exac-vcf ExAC vcf.gz
+	--mac-clinvar-tsv ClinVar from MacArthur lab (clinvar_alleles.tsv.gz)
+Using local databases suppresses the BioMine accession too. 
+These files can be downloaded from their respective sites.
+
+### Filters
+	--rare Allele frequency threshold for rare/common (default = 1, process variant with any frequency):
+	--vcf-any-filter Allow variants that do not pass all filters in .vcf input (flag)
+	--mutation-types Comma delimited list of types to allow
+Using filters will limit the variants processed. 
+The rare option takes variants with allele frequency less than the given value. 
+The .vcf any filter accepts only variants that have passed all filters. 
+If no .vcf pass filter status given, the .vcf null value will be taken as having passed. 
+Mutation types filtering requires a comma delimitted list (no spaces) using terms from Ensembl's consequence terms.
+
+### ReST batch sizes
+	-v VEP (#variants, default/max allowed = 150)
+	-b ClinVar summary (#variants, default/max allowed = 500)
+	-B ClinVar searchsize (#variants, default/max allowed = 50)
+ReST API's usually have limits on the amount of data sent or received. 
+Exceeding these batch sizes would normally lead to warnings and/or IP blockage, but CharGer and BioMine try to keep batches at safe sizes. Last updated limits February 2017.
+
+### Custom columns (0-based)
+	-G HUGO gene symbol
+	-X chromosome
+	-S start position
+	-P stop position
+	-R reference allele
+	-A alternate allele
+	-s strand
+	-M sample name
+	-C codon
+	-p peptide change
+	-L variant classification
+	-F allele frequency
+Use these for .tsv and/or .maf input variant files to specify columns of relevant data. 
+CharGer makes use of genomic and protein variant annotations, so the more data made available the better your results.
diff --git a/bin/charger b/bin/charger
@@ -1,15 +1,16 @@
 #!/bin/python
 # CharGer - Characterization of Germline variants
 # author: Adam D Scott ([email protected]) & Kuan-lin Huang ([email protected])
-# version: v0.1.0 - 2015*12
+# version: v0.1.0 - 2016*04
 
 import sys
 import getopt
 from charger import charger
 import time
 
 def parseArgs( argv ):
-	helpText = "Usage: "
+	helpText = "\nCharGer - v0.1.0\n\n"
+	helpText += "Usage: "
 	helpText += "charger <input file> [options]\n\n"
 	helpText += "Accepted input data files:\n"
 	helpText += "  -m Standard .maf\n"
@@ -29,37 +30,37 @@ def parseArgs( argv ):
 	helpText += "  -O override with ClinVar description (flag)\n"
 	helpText += "  -D suppress needing disease specific (flag)\n"
 	helpText += "Cross-reference data files:\n"
-	helpText += "  -z pathogenic variants .vcf\n"
-	helpText += "  -e expression matrix file .tsv\n"
-	helpText += "  -g gene list file .txt\n"
-	helpText += "  -d diseases file (format: gene\\tdisease\\tmode_of_inheritance) .tsv\n"
-	helpText += "  -n de novo file .?\n"
-	helpText += "  -a assumed de novo file .?\n"
-	helpText += "  -c co-segregation file .?\n"
-	helpText += "  -H HotSpot3D clusters file .clusters\n"
-	helpText += "  -r recurrence threshold (default = )\n"
+	helpText += "  -z pathogenic variants, .vcf\n"
+	helpText += "  -e expression matrix file, .tsv\n"
+	helpText += "  -g gene list file, (format: gene\\tdisease\\tmode_of_inheritance) .txt\n"
+	helpText += "  -d diseases file, (format: gene\\tdisease\\tmode_of_inheritance) .tsv\n"
+	helpText += "  -n de novo file, standard .maf\n"
+	helpText += "  -a assumed de novo file, standard .maf\n"
+	helpText += "  -c co-segregation file, standard .maf\n"
+	helpText += "  -H HotSpot3D clusters file, .clusters\n"
+	helpText += "  -r recurrence threshold (default = 2)\n"
 	helpText += "Local VEP (works with .vcf input only; suppresses ReST too):\n"
 	helpText += "  --vep-script Path to VEP\n"
 	helpText += "  --vep-dir Path to VEP directory\n"
 	helpText += "  --vep-cache Path to VEP cache directory\n"
 	helpText += "  --vep-version VEP version (default = 87)\n"
 	helpText += "  --vep-output VEP output file (default = charger.vep.vcf)\n"
 	helpText += "  --grch assembly GRCh verion (default = 37)\n"
-	helpText += "  --ensembl-release Ensembl release version (default = 74)\n"
+	helpText += "  --ensembl-release Ensembl release version (default = 75)\n"
 	helpText += "  --reference-fasta VEP reference fasta\n"
 	helpText += "  --fork Number of forked processes used in VEP (default = 0) \n"
 	helpText += "Local databases (suppresses ReST too):\n"
 	helpText += "  --exac-vcf ExAC vcf.gz\n"
 	#helpText += "  --clinvar-tsv ClinVar (.tsv.gz download)\n"
 	#helpText += "  --clinvar-vcf ClinVar (.vcf.gz download)\n"
 	helpText += "  --mac-clinvar-tsv ClinVar from MacArthur lab (clinvar_alleles.tsv.gz)\n"
-	helpText += "  --mac-clinvar-vcf ClinVar from MacArthur lab (clinvar_alleles.vcf.gz)\n"
+	#helpText += "  --mac-clinvar-vcf ClinVar from MacArthur lab (clinvar_alleles.vcf.gz)\n"
 	helpText += "Filters:\n"
 	helpText += "  --rare Allele frequency threshold for rare/common (default = 1, process variant with any frequency):\n"
 	helpText += "  --vcf-any-filter Allow variants that do not pass all filters in .vcf input (flag)\n"
-	helpText += "  --mutation-types Comma delimited list of types to allow\n"
+	helpText += "  --mutation-types Comma delimited list (no spaces) of types to allow\n"
 	helpText += "ReST batch sizes:\n"
-	helpText += "  -v VEP (#variants, default/max allowed = )\n"
+	helpText += "  -v VEP (#variants, default/max allowed = 150)\n"
 	helpText += "  -b ClinVar summary (#variants, default/max allowed = 500)\n"
 	helpText += "  -B ClinVar searchsize (#variants, default/max allowed = 50)\n"
 	helpText += "Custom columns (0-based)\n"

diff --git a/charger/charger.py b/charger/charger.py
@@ -109,7 +109,6 @@ def getInputData( self  , **kwargs ):
 		deNovoFile = kwargs.get( 'deNovo' , "" )
 		assumedDeNovoFile = kwargs.get( 'assumedDeNovo' , "" )
 		coSegregateFile = kwargs.get( 'coSegregate' , "" )
-		tcga = kwargs.get( 'tcga' , True )
 		diseaseFile = kwargs.get( 'diseases' , "" )
 		specific = kwargs.get( 'specific' , False )
 		self.getDiseases( diseaseFile , **kwargs )
@@ -131,6 +130,12 @@ def getInputData( self  , **kwargs ):
 			exacDone = self.readTSV( tsvFile , **kwargs )
 		if pathogenicVariantsFile:
 			self.readVCF( pathogenicVariantsFile , appendTo="pathogenic" , **kwargs )
+		if deNovoFile:
+			self.readDeNovo( deNovoFile )
+		if assumedDeNovoFile:
+			self.readAssumedDeNovo( assumedDeNovoFile )
+		if coSegregateFile:
+			self.readCoSegregate( coSegregateFile )
 		if expressionFile:
 			self.readExpression( expressionFile )
 		else: 
@@ -733,10 +738,10 @@ def readGeneList( self , inputFile , **kwargs ): # gene list formatted "gene", "
 			print "CharGer::readGeneList Error: bad gene list file"
 	def readDeNovo( self , inputFile ):
 		self.readOtherMAF( inputFile , varDict = self.deNovoVariants )
-	def readCoSegregate( self , inputFile ):
-		self.readOtherMAF( inputFile , varDict = self.coSegregateVariants )
 	def readAssumedDeNovo( self , inputFile ):
 		self.readOtherMAF( inputFile , varDict = self.assumedDeNovoVariants )
+	def readCoSegregate( self , inputFile ):
+		self.readOtherMAF( inputFile , varDict = self.coSegregateVariants )
 	def readOtherMAF( self , inputFile, varDict ):
 		try:
 			inFile = self.safeOpen( inputFile , 'r' , warning=True )
@@ -772,7 +777,15 @@ def getClinVar( self , **kwargs ):
 		macClinVarVCF = kwargs.get( 'macClinVarVCF' , None )
 		doClinVar = kwargs.get( 'clinvar' , False )
 		summaryBatchSize = kwargs.get( 'summaryBatchSize' , 500 )
-		searchBatchSize = kwargs.get( 'searchBatchSize' , 50 )
+		maxSearchBatchSize = 50
+		searchBatchSize = kwargs.get( 'searchBatchSize' , maxSearchBatchSize )
+		if searchBatchSize > maxSearchBatchSize:
+			message = "warning: ClinVar ReST search batch size given is "
+			message += "greater than max allowed ("
+			message += str( maxSearchBatchSize ) + ")"
+			message += ". Overriding to max search batch size."
+			print( message )
+			searchBatchSize = maxSearchBatchSize
 		#print( '  '.join( [ str( doClinVar ) , str( macClinVarTSV ) , str( macClinVarVCF ) ] ) ) 
 		if doClinVar:
 			if macClinVarTSV:
@@ -1421,15 +1434,15 @@ def checkClinVarPC( self , var , mod , **kwargs ):
 			#if genomic change is the same, then PS1
 				if clin["description"] == clinvarvariant.pathogenic:
 					if mod == "PS1":
-						print( "also PS1 via samGenomicVariant" )
+						#print( "also PS1 via samGenomicVariant" )
 						var.PS1 = True # already pathogenic still suffices to be PS1
 						called = 1
 			elif consequence.sameGenomicReference( clinvarVar ):
 			#if genomic change is different, but the peptide change is the same, then PS1
 				if clinvarVar.alternatePeptide == consequence.alternatePeptide: #same amino acid change
 					if clin["description"] == clinvarvariant.pathogenic:
 						if mod == "PS1":
-							print( "also PS1 via sameGenomicReference" )
+							#print( "also PS1 via sameGenomicReference" )
 							var.PS1 = True
 							called = 1
 			if consequence.samePeptideReference( clinvarVar ):
@@ -1438,7 +1451,7 @@ def checkClinVarPC( self , var , mod , **kwargs ):
 					if consequence.plausibleCodonFrame( clinvarVar ):
 						if clin["description"] == clinvarvariant.pathogenic:
 							if mod == "PM5":
-								print( "also PS5 via plausibleCodonFrame" )
+								#print( "also PS5 via plausibleCodonFrame" )
 								var.PM5 = True # already pathogenic still suffices to be PS1
 								called = 1
 		return called

diff --git a/setup.py b/setup.py
@@ -15,11 +15,13 @@
 		classification, CharGer cross-checks user variants against \
 		several public databases for information. CharGer then \
 		provides a pathogenicity score according to ACMG or \
-		CharGers custom scale.' ,
+		CharGer\'s custom scale.' ,
 	download_url = 'https://github.com/ding-lab/CharGer/archive/v' + \
 		version + '.tar.gz' ,
 	scripts = ['bin/charger'] ,
-	keywords = ["bioinformatics" , "genomics"] ,
+	keywords = [ "bioinformatics" , "genomics" , "pathogenic" , "benign" , \
+				"ClinVar" , "ExAC" , "VEP" , "BioMine" , "germline" , \
+				"variant" , "classifier" ] ,
 	classifiers = [ \
 		"License :: OSI Approved :: MIT License " , 
 		"Programming Language :: Python" ,