Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default output type of vcf2phylip.py: too many ambiguous nucleotide sequences? #51

Open
LiuCanidk opened this issue Oct 20, 2024 · 7 comments

Comments

@LiuCanidk
Copy link

Hi, thanks for developing this tool

I run the script of vcf2phylip.py successfully but found the output seems to be the amino acid sequences. My code and the screenshort of my output file are as follows:

python /work/share/acuwbf4fll/liucan/software/phylip/vcf2phylip-2.8/vcf2phylip.py -i /work/share/acuwbf4fll/liucan/HND_project/Bulk_RNA_variant_calling/06.GVCF_filter/output/HND.SNV.recode.vcf --output-folder /work/share/acuwbf4fll/liucan/HND_project/Bulk_RNA_variant_calling/09.phylotree --output-prefix HND_RNA_SNV

image

I did not found any parameters specified to set the output type, but I prefer the nucelotide sequences alignment to be output. How can I do for this?

Any suggestions would be greatly appreciated!

@edgardomortiz
Copy link
Owner

They are nucleotides, VCF doesn't support aminoacids as far as I know. Your heterozygous genotypes are represented with ambiguity codes, see here https://www.promega.com/resources/guides/nucleic-acid-analysis/restriction-enzyme-resource/restriction-enzyme-resource-tables/iupac-ambiguity-codes-for-nucleotide-degeneracy/

Edgardo

@LiuCanidk
Copy link
Author

They are nucleotides, VCF doesn't support aminoacids as far as I know. Your heterozygous genotypes are represented with ambiguity codes, see here https://www.promega.com/resources/guides/nucleic-acid-analysis/restriction-enzyme-resource/restriction-enzyme-resource-tables/iupac-ambiguity-codes-for-nucleotide-degeneracy/

Edgardo

@edgardomortiz I see. Thanks for your reply! I then wonder if it is normal that my translated phylip file was filled with ambigous code and whether this wound affect the process of tree construction. If so, then should I enable the parameter of --resolve-IUPAC to choose one nucleotide forcely?

@edgardomortiz
Copy link
Owner

I don't think it is a good idea to translate SNPs, they are not contiguous in the genome. Besides that, degenerate nucleotides will create degenerate aminoacids as well during translation. The option --resolve-IUPAC will choose one nucleotide at random when you have an ambiguity, you may try that but I think I won't fix your issue of trying to translate SNPs (unless I am missing something about your specific VCF).

I hope this makes sense,

Edgardo

@LiuCanidk
Copy link
Author

I don't think it is a good idea to translate SNPs, they are not contiguous in the genome. Besides that, degenerate nucleotides will create degenerate aminoacids as well during translation. The option --resolve-IUPAC will choose one nucleotide at random when you have an ambiguity, you may try that but I think I won't fix your issue of trying to translate SNPs (unless I am missing something about your specific VCF).

I hope this makes sense,

Edgardo

Thanks for your reply. By stating "translating SNPs", I mean translating from the VCF format to a format of alignment, e.g., phylip format, for tree construction. There may be some misleading that I did not mean translating from nucleotides to amino acid sequences. Sorry about that.

I agree that SNPs are discontinuous in the genome. I am just wondering why I got so many ambiguous sequences from VCF format and whether I should add the -resolve-IUPAC parameter to avoid this situation. That is, would too many ambiguous sequences hamper the downstream analysis of tree construction?

Thanks in advance

@LiuCanidk LiuCanidk changed the title Default output type of vcf2phylip.py: Amino acid or nucleotide sequences Default output type of vcf2phylip.py: too many ambiguous nucleotide sequences? Oct 22, 2024
@edgardomortiz
Copy link
Owner

edgardomortiz commented Oct 22, 2024

Ah I see, you meant converting VCF to another format (sorry for being pedantic but translating has a biological meaning and I got confused). As I said above, you have heterozygous genotypes because I assume your organism is at least diploid. For phylogenetics it is common to use a single sequence per sample, the way to achieve this is by representing both possible nucleotides with a single ambiguity code. As for the consequences of these ambiguities on your data I can't predict them because I am obviously not familiar with the organisms you are analyzing, but in general I could say the more ambiguities the less resolved a tree ends up.

Maybe your SNP calling settings were set up incorrectly? Maybe your reference genome is too distant? I don't know, I am just speculating here...

Edgardo

@LiuCanidk
Copy link
Author

LiuCanidk commented Oct 22, 2024

Oh, sorry about the information loss. The organism is human, and more specificly, the material is a cancer cell line and of course with some treatments.

I checked the VCF file and did find something weird: some genotypes are missing, maybe it is the cause and may be due to hard genotype filtration.
image

However, I wonder whether what you said about representing both possible nucleotides with a single ambiguity code could work. How can I achieve this?

@edgardomortiz
Copy link
Owner

edgardomortiz commented Oct 22, 2024

However, I wonder whether what you said about representing both possible nucleotides with a single ambiguity code could work. How can I achieve this?

This is what the script does by default, the reason you have the ambiguity codes in the first place. No need to do anything additional...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants