Default output type of vcf2phylip.py: too many ambiguous nucleotide sequences? #51

LiuCanidk · 2024-10-20T13:41:02Z

Hi, thanks for developing this tool

I run the script of vcf2phylip.py successfully but found the output seems to be the amino acid sequences. My code and the screenshort of my output file are as follows:

python /work/share/acuwbf4fll/liucan/software/phylip/vcf2phylip-2.8/vcf2phylip.py -i /work/share/acuwbf4fll/liucan/HND_project/Bulk_RNA_variant_calling/06.GVCF_filter/output/HND.SNV.recode.vcf --output-folder /work/share/acuwbf4fll/liucan/HND_project/Bulk_RNA_variant_calling/09.phylotree --output-prefix HND_RNA_SNV

I did not found any parameters specified to set the output type, but I prefer the nucelotide sequences alignment to be output. How can I do for this?

Any suggestions would be greatly appreciated!

edgardomortiz · 2024-10-20T22:19:12Z

They are nucleotides, VCF doesn't support aminoacids as far as I know. Your heterozygous genotypes are represented with ambiguity codes, see here https://www.promega.com/resources/guides/nucleic-acid-analysis/restriction-enzyme-resource/restriction-enzyme-resource-tables/iupac-ambiguity-codes-for-nucleotide-degeneracy/

Edgardo

LiuCanidk · 2024-10-21T01:53:27Z

They are nucleotides, VCF doesn't support aminoacids as far as I know. Your heterozygous genotypes are represented with ambiguity codes, see here https://www.promega.com/resources/guides/nucleic-acid-analysis/restriction-enzyme-resource/restriction-enzyme-resource-tables/iupac-ambiguity-codes-for-nucleotide-degeneracy/

Edgardo

@edgardomortiz I see. Thanks for your reply! I then wonder if it is normal that my translated phylip file was filled with ambigous code and whether this wound affect the process of tree construction. If so, then should I enable the parameter of --resolve-IUPAC to choose one nucleotide forcely？

edgardomortiz · 2024-10-21T14:53:18Z

I don't think it is a good idea to translate SNPs, they are not contiguous in the genome. Besides that, degenerate nucleotides will create degenerate aminoacids as well during translation. The option --resolve-IUPAC will choose one nucleotide at random when you have an ambiguity, you may try that but I think I won't fix your issue of trying to translate SNPs (unless I am missing something about your specific VCF).

I hope this makes sense,

Edgardo

LiuCanidk · 2024-10-22T01:37:12Z

I don't think it is a good idea to translate SNPs, they are not contiguous in the genome. Besides that, degenerate nucleotides will create degenerate aminoacids as well during translation. The option --resolve-IUPAC will choose one nucleotide at random when you have an ambiguity, you may try that but I think I won't fix your issue of trying to translate SNPs (unless I am missing something about your specific VCF).

I hope this makes sense,

Edgardo

Thanks for your reply. By stating "translating SNPs", I mean translating from the VCF format to a format of alignment, e.g., phylip format, for tree construction. There may be some misleading that I did not mean translating from nucleotides to amino acid sequences. Sorry about that.

I agree that SNPs are discontinuous in the genome. I am just wondering why I got so many ambiguous sequences from VCF format and whether I should add the -resolve-IUPAC parameter to avoid this situation. That is, would too many ambiguous sequences hamper the downstream analysis of tree construction?

Thanks in advance

edgardomortiz · 2024-10-22T01:45:55Z

Ah I see, you meant converting VCF to another format (sorry for being pedantic but translating has a biological meaning and I got confused). As I said above, you have heterozygous genotypes because I assume your organism is at least diploid. For phylogenetics it is common to use a single sequence per sample, the way to achieve this is by representing both possible nucleotides with a single ambiguity code. As for the consequences of these ambiguities on your data I can't predict them because I am obviously not familiar with the organisms you are analyzing, but in general I could say the more ambiguities the less resolved a tree ends up.

Maybe your SNP calling settings were set up incorrectly? Maybe your reference genome is too distant? I don't know, I am just speculating here...

Edgardo

LiuCanidk · 2024-10-22T07:10:12Z

Oh, sorry about the information loss. The organism is human, and more specificly, the material is a cancer cell line and of course with some treatments.

I checked the VCF file and did find something weird: some genotypes are missing, maybe it is the cause and may be due to hard genotype filtration.

However, I wonder whether what you said about representing both possible nucleotides with a single ambiguity code could work. How can I achieve this?

edgardomortiz · 2024-10-22T14:12:59Z

However, I wonder whether what you said about representing both possible nucleotides with a single ambiguity code could work. How can I achieve this?

This is what the script does by default, the reason you have the ambiguity codes in the first place. No need to do anything additional...

LiuCanidk changed the title ~~Default output type of vcf2phylip.py: Amino acid or nucleotide sequences~~ Default output type of vcf2phylip.py: too many ambiguous nucleotide sequences? Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default output type of vcf2phylip.py: too many ambiguous nucleotide sequences? #51

Default output type of vcf2phylip.py: too many ambiguous nucleotide sequences? #51

LiuCanidk commented Oct 20, 2024

edgardomortiz commented Oct 20, 2024

LiuCanidk commented Oct 21, 2024

edgardomortiz commented Oct 21, 2024

LiuCanidk commented Oct 22, 2024

edgardomortiz commented Oct 22, 2024 •

edited

Loading

LiuCanidk commented Oct 22, 2024 •

edited

Loading

edgardomortiz commented Oct 22, 2024 •

edited

Loading

Default output type of vcf2phylip.py: too many ambiguous nucleotide sequences? #51

Default output type of vcf2phylip.py: too many ambiguous nucleotide sequences? #51

Comments

LiuCanidk commented Oct 20, 2024

edgardomortiz commented Oct 20, 2024

LiuCanidk commented Oct 21, 2024

edgardomortiz commented Oct 21, 2024

LiuCanidk commented Oct 22, 2024

edgardomortiz commented Oct 22, 2024 • edited Loading

LiuCanidk commented Oct 22, 2024 • edited Loading

edgardomortiz commented Oct 22, 2024 • edited Loading

edgardomortiz commented Oct 22, 2024 •

edited

Loading

LiuCanidk commented Oct 22, 2024 •

edited

Loading

edgardomortiz commented Oct 22, 2024 •

edited

Loading