Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot run phASER to my aligned data #57

Open
x811zou opened this issue Nov 1, 2019 · 5 comments
Open

cannot run phASER to my aligned data #57

x811zou opened this issue Nov 1, 2019 · 5 comments

Comments

@x811zou
Copy link

x811zou commented Nov 1, 2019

Hi!
I aligned HG00096 from 1000GP with STAR and Tophat separately. And then I ran phASER using the same parameter you gave in the tutorial.

The error message for running phASER with the BAM aligned with STAR is:

#2. Retrieving reads that overlap heterozygous sites...
     file: /data/reddylab/scarlett/1000G/data/StarOutput/HG00096/Aligned.sortedByCoord.out.bam
          minimum mapq: 255
          mapping reads to variants...
[bam_parse_region] fail to determine the sequence name.
[main_samview] region "1:" specifies an unknown reference name. Continue anyway.
[samopen] SAM header is present: 25 sequences.
[sam_read1] reference 'user command line: STAR --runMode alignReads --genomeDir /data/reddylab/scarlett/1000G/data/STARIndex --runThreadN 24 --readFilesCommand zcat --readFilesIn /gpfs/fs1/data/reddylab/scarlett/1000G/data/fastq/HG00096/ERR188040_R1.fastq.gz /gpfs/fs1/data/reddylab/scarlett/1000G/data/fastq/HG00096/ERR188040_R2.fastq.gz --outSAMtype BAM Unsorted SortedByCoordinate --outReadsUnmapped Fastx --outFileNamePrefix /data/reddylab/scarlett/1000G/data/StarOutput/HG00096/
AMtype BAM   Unsorted   SortedByCoordinate
' is recognized as '*'.
[main_samview] truncated file.
Exception in thread Thread-6:
Traceback (most recent call last):
  File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/multiprocessing/pool.py", line 389, in _handle_results
    task = get()
TypeError: ('__init__() takes at least 3 arguments (1 given)', <class 'subprocess.CalledProcessError'>, ())

The problem for using phASER with the BAM aligned with Tophat is that it stuck at first step forever ...

STARTED "Read backed phasing and ASE/haplotype analyses" ...
    DATE, TIME : 2019-11-01, 10:44:04
#1. Loading heterozygous variants into intervals...
Processing sample named HG00096
    using all the chromosomes ...
    processing VCF...

Does phASER work with Tophat aligned bam file? And what parameters do you specify for STAR alignment? I wonder whether that is the point where causing these errors.
Could you help me with this?
I'd appreciate your help!

Thanks,
Scarlett

@secastel
Copy link
Owner

secastel commented Nov 3, 2019

Hi Scarlett, I answered you in the other thread, but just to clarify for other people who may see this. It looks like you are using VCFs and BAMs with discordant chromosome naming. Please make sure that both the VCF and BAM have the same naming (e.g. "1" vs "chr1"). Once you have consistent naming between the two, make sure to use the appropriate annotation files with phASER (with or without "chr"), which can be downloaded from here: https://stephanecastel.wordpress.com/2017/02/15/how-to-generate-ase-data-with-phaser/

@x811zou
Copy link
Author

x811zou commented Nov 4, 2019

@secastel Thank you! I just figured it out! But my result gene_ae.txt columns (totalCount log2_aFC n_variants variants gw_phased) are all zeros. I am wondering whether it is because my input data is unphased, which only have 0/0,1/1,0/1,1/0 in VCF. Would it explain the "0" variants below? Does phASER work with unphased data?
Thanks!

#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...

     COMPLETED using 1264249 reads in 492 seconds using 4 threads
     PHASED  10111 of 30870 all variants (= 0.327535) with at least one other variant
     GENOME WIDE PHASED  0 of 34022 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 30870 variants (= 0.000000)
     Global maximum memory usage: 1262.21 (mb)

@secastel
Copy link
Owner

secastel commented Nov 5, 2019

phASER should work with an unphased VCF, although it will work much, much better if you start with a phased VCF. If you are working with human samples, I would strongly recommend phasing your VCF using a tool like the Sanger Imputation Server (https://imputation.sanger.ac.uk).

As for why you are seeing all zeros, I expect it may have something to do with a contig naming problem. When running phaser_gene_ae you need to make sure to use the right features file to match your chromosome naming. For example (assuming hg19), if your chromosomes are named “chr1” then you need to use this file: https://www.dropbox.com/s/am09zwpjhs01k8u/gencode.v19.GRCh37.genes.chr.bed.gz?dl=0 otherwise if they are named “1” you need to use this file: https://www.dropbox.com/s/1u9zo1kx61zx6ca/gencode.v19.GRCh37.genes.bed.gz?dl=0

Please let me know if this helps.

@x811zou
Copy link
Author

x811zou commented Nov 6, 2019

@secastel Thank you! I just have another question regarding to the "unphased" genotype from the output phased VCF file from phASER, I do not see the phased genotype for each variant.
I am attaching the first line from the input VCF and output VCF below:

##### Input VCF
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	20
chr1	14542	.	A	G	22.44	PASS	AC=2;AF=1.00;AN=2;DP=1;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=22.44;SOR=1.609	GT:AD:DP:GQ:PL	1/1:0,1:1:3:32,3,0

###### Output VCF from phASER
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	20
chr1	14542	.	A	G	22.44	PASS	AC=2;AF=1.00;AN=2;DP=1;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=22.44;SOR=1.609	GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM	1/1:0,1:1:3:32,3,0:1/1:.:.:1/1:.:.

I just checked haplotypic_counts.txt, and I found that instead of outputting the haplotype for each variant, the output files have listed the alleles for each (A/B) haplotype at each chromosome-position range. Please let me know if I miss anything! Thank you!

@secastel
Copy link
Owner

secastel commented Nov 8, 2019

The phase will be updated in the "GT" field of the outputted VCF if you have included the argument "--gw_phase_vcf 1". In the case of the first line that you've shown, first, that variant is not covered by any reads, thus phaser will not phase it, and second, it is a homozygous variant, so there is no phase to speak of. phASER runs best when your input VCF has been phased using population phasing. If you are working with human data, I would suggest using a tool like the Sanger Imputation Server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants