Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I find out number of patterns and number of partitions from my input file without running Beast2? #1134

Open
mzhuangsdsc opened this issue Nov 4, 2023 · 6 comments

Comments

@mzhuangsdsc
Copy link

Hello,
I have a Beast2 input file. I know if I run the Beas2 jar, I can find out number of patterns and number of partitions from the log file.
Is there a script somewhere that can give me that information without running the Beast2 jar?
Or can somebody tell me how to write a simple script to get number of patterns and number of partitions from my input file?

Thank you very much!

@rbouckaert
Copy link
Member

If you run beast -validate beast.xml from the command line, BEAST will parse the XML but will not start the MCMC. It prints out the pattern and site counts for each of the alignments in the XML, just like when starting any other BEAST run. Would that be sufficient for what you need?

@achourasia
Copy link

@rbouckaert we would like to fetch two specific elements: pattern and site counts from the input file in a php application. If you could provide any pointers/rules that could be used to parse this information from xml file that would be great, as we don't know much about the structure of the input file and its biological interpretation.

@rbouckaert
Copy link
Member

Not familiar with php, but I suppose it can launch an application and parse its output. If so, you could install the Babel package for BEAST 2 and run something like
applauncher Nexus2Fasta -in alignment.nex -out /dev/null | grep patterns
which converts the alignment to fasta, but as side effect prints out the number of patters (and taxa and sites).
You could also write your own package and start with the code for Nexus2Fasta, which is here: https://github.com/rbouckaert/Babel/blob/master/src/babel/tools/Nexus2Fasta.java and remove the parts for exporting fasta.

@achourasia
Copy link

achourasia commented Nov 6, 2023

Thanks for additional information. We are unable to install other tools and run them on our server, so we need to parse the input XML file which is easy to do in PHP. However, we don't know what to look for in the XML file, if you or someone could provide us any hints on which structures to pull out from XML file and combine them to identify partitions and patterns count, that will do the trick for us.

@rbouckaert
Copy link
Member

I see: what you are looking for are the alignments, which typically have the attribute spec="Alignment".
Each alignment contains sequences in sequence elements. The sequence data can be found in the value attribute of the sequence elements.
If there are partitions (like splits on codon positions, or for different genes), there may be elements with attribute spec="FilteredAlignment" and a reference to the main alignment. Further, there is a filter attribute that specifies which sites to select from the main alignment as follows:

            First site is 1.
            Filter specs are comma separated, either a singleton, a range [from]-[to] or iteration [from]:[to]:[step]; 
            1-100 defines a range, 
            1-100\3 or 1:100:3 defines every third in range 1-100, 
            1::3,2::3 removes every third site. 
            Default for range [1]-[last site], default for iterator [1]:[last site]:[1]

When BEAST runs, it shows information from main alignments as well as filtered alignments.
Hope this helps.

@achourasia
Copy link

achourasia commented Nov 20, 2023

@rbouckaert thanks for sharing additional information, this helps us to easily identify the partitions, but we also need to find the number of patterns. I discussed this with one of my colleagues who has significant experience with phylogeny codes. He explained that counting patterns is computationally intensive and complex, and is further complicated with input files constructed in few different formats. This essentially would mean we'd need to recreate the entire parser in PHP and deal with memory and compute requirements to identify the number of patterns. So we'll need to step back and use other existing tools like IQtree to calculate and provide this information. Thanks again though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants