Skip to content

v2.0.0

Latest

Choose a tag to compare

@tschuelia tschuelia released this 05 Mar 12:42
· 4 commits to master since this release

Pythia v2.0.0

Ever since I started working on Pythia when I just started my PhD, I learned a lot about phylogenetics and about coding. So this release, briefly before I will finish my PhD feels a little like a time capsule 🙂 Okay so enough sentiment, let's talk about changes: I did a complete refactoring of the code and I changed on of the most influential features of Pythia. The prediction accuracy is not affected by this, but Pythia is now about twice as fast as its predecessor which is nice!

As a lot of things have changed in Pythia over time, we will soon publish a pre-print that explains all changes in detail, including new analyses of the Pythia predictive performance, so stay tuned! 🙂

Breaking changes:

  • We trained a new lightGBM Boosting model that uses only 24 instead of 100 maximum parsimony trees. The new model is as accurate as the previous version in predicting the difficulty while being faster (more on runtime below).
  • We majorly refactored the codebase, with the biggest changes in pypythia.msa. Instead of using the parsed Biopython MSA, we transform the sequences to numpy byte char arrays (dtype S1). This enables us to easily compute the number of patterns, proportion of gaps and proportion of invariant sites directly without using RAxML-NG as we previously did. It also improves the runtime of computing the Entropy, Pattern-Entropy, and Bollback Multinomial.

Thanks to the reduced number of maximum parsimony trees and our changes in MSA representation, Pythia 2.0 is about 2 times faster compared to its predecessor 🚀

  • This refactoring changed the interface of the Python API for loading MSAs. Instead of initializing an MSA object via pypythia.msa.MSA("path/to/msa.phy") you now have to use pypythia.msa.parse_msa(pathlib.Path("path/to/msa.phy")) .
  • We removed the legacy support for the older Pythia versions using a scikit-learn RandomForestRegressor or lightgbm.LGBMRegressor (Pythia versions 0.0 – 1.2.1). From this version on, we only support lightgbm.Booster models. This should resolve a lot of issues with broken dependencies leading to old versions being installed.
  • We changed the command line interface, see the extra section below. The most important changes are that Pythia will now remove duplicate and full-gap sequences per default and you have to explicitly disable this option if you are sure you want to predict the difficulty for the full MSA.
  • We now provide an all-in-one convenience function for predicting the difficulty for an MSA via the Python API: pypythia.prediction.predict_difficulty . You only need to pass the filepath of the MSA (and maybe RAxML-NG depending on your setup) to get the predicted difficulty.

Command line changes:

  • We removed the --removeDuplicates and --removeFullGaps flags. Per default, Pythia will remove duplicate sequences and full-gap sequences. If you want to disable this behavior and predict the difficulty for your full MSA, you can use the new flags --forceDuplicates and --forceFullGaps

  • We removed the --verbose and --benchmark flags. Pythia now always prints the computed features and the total runtime. We do not print the runtime for the computation individual features anymore, please use the Python API if you want to benchmark this.

  • Pythia now writes more result files. Per default, Pythia writes the following files:

    • A logfile containing the same information as printed to the terminal: {result_prefix}.pythia.log
    • The reduced MSA file in case the input MSA contained duplicate/full-gap sequences (and the reduction was not disabled): {result_prefix}.reduced.phy
    • The inferred parsimony trees in Newick format: {result_prefix}.pythia.trees
    • The shapley values as waterfall plot (if --shap is set): {result_prefix}.shap.pdf
    • The features and predicted difficulty as CSV file: {result_prefix}.pythia.csv

    The result_prefix can be set using the --prefix command line option. If not set, Pythia uses the MSA file as prefix. You can prevent Pythia from writing any files via the flag --nofiles.

  • Pythia now provides a default for the --raxmlng / -r flag: Pythia searches for a binary of raxml-ng in your $PATH such that you only have to pass a RAxML-NG path if raxml-ng is not available.

Other

Start with Pythia 2.0, the package is now also available on pypi allowing easy installation of Pythia via pip:
pip install pythiaphylopredictor