Skip to content

Releases: tschuelia/PyPythia

v2.0.0

05 Mar 12:42

Choose a tag to compare

Pythia v2.0.0

Ever since I started working on Pythia when I just started my PhD, I learned a lot about phylogenetics and about coding. So this release, briefly before I will finish my PhD feels a little like a time capsule 🙂 Okay so enough sentiment, let's talk about changes: I did a complete refactoring of the code and I changed on of the most influential features of Pythia. The prediction accuracy is not affected by this, but Pythia is now about twice as fast as its predecessor which is nice!

As a lot of things have changed in Pythia over time, we will soon publish a pre-print that explains all changes in detail, including new analyses of the Pythia predictive performance, so stay tuned! 🙂

Breaking changes:

  • We trained a new lightGBM Boosting model that uses only 24 instead of 100 maximum parsimony trees. The new model is as accurate as the previous version in predicting the difficulty while being faster (more on runtime below).
  • We majorly refactored the codebase, with the biggest changes in pypythia.msa. Instead of using the parsed Biopython MSA, we transform the sequences to numpy byte char arrays (dtype S1). This enables us to easily compute the number of patterns, proportion of gaps and proportion of invariant sites directly without using RAxML-NG as we previously did. It also improves the runtime of computing the Entropy, Pattern-Entropy, and Bollback Multinomial.

Thanks to the reduced number of maximum parsimony trees and our changes in MSA representation, Pythia 2.0 is about 2 times faster compared to its predecessor 🚀

  • This refactoring changed the interface of the Python API for loading MSAs. Instead of initializing an MSA object via pypythia.msa.MSA("path/to/msa.phy") you now have to use pypythia.msa.parse_msa(pathlib.Path("path/to/msa.phy")) .
  • We removed the legacy support for the older Pythia versions using a scikit-learn RandomForestRegressor or lightgbm.LGBMRegressor (Pythia versions 0.0 – 1.2.1). From this version on, we only support lightgbm.Booster models. This should resolve a lot of issues with broken dependencies leading to old versions being installed.
  • We changed the command line interface, see the extra section below. The most important changes are that Pythia will now remove duplicate and full-gap sequences per default and you have to explicitly disable this option if you are sure you want to predict the difficulty for the full MSA.
  • We now provide an all-in-one convenience function for predicting the difficulty for an MSA via the Python API: pypythia.prediction.predict_difficulty . You only need to pass the filepath of the MSA (and maybe RAxML-NG depending on your setup) to get the predicted difficulty.

Command line changes:

  • We removed the --removeDuplicates and --removeFullGaps flags. Per default, Pythia will remove duplicate sequences and full-gap sequences. If you want to disable this behavior and predict the difficulty for your full MSA, you can use the new flags --forceDuplicates and --forceFullGaps

  • We removed the --verbose and --benchmark flags. Pythia now always prints the computed features and the total runtime. We do not print the runtime for the computation individual features anymore, please use the Python API if you want to benchmark this.

  • Pythia now writes more result files. Per default, Pythia writes the following files:

    • A logfile containing the same information as printed to the terminal: {result_prefix}.pythia.log
    • The reduced MSA file in case the input MSA contained duplicate/full-gap sequences (and the reduction was not disabled): {result_prefix}.reduced.phy
    • The inferred parsimony trees in Newick format: {result_prefix}.pythia.trees
    • The shapley values as waterfall plot (if --shap is set): {result_prefix}.shap.pdf
    • The features and predicted difficulty as CSV file: {result_prefix}.pythia.csv

    The result_prefix can be set using the --prefix command line option. If not set, Pythia uses the MSA file as prefix. You can prevent Pythia from writing any files via the flag --nofiles.

  • Pythia now provides a default for the --raxmlng / -r flag: Pythia searches for a binary of raxml-ng in your $PATH such that you only have to pass a RAxML-NG path if raxml-ng is not available.

Other

Start with Pythia 2.0, the package is now also available on pypi allowing easy installation of Pythia via pip:
pip install pythiaphylopredictor

v1.2.1

16 Oct 09:25

Choose a tag to compare

In my last release, I uploaded a predictor with an incorrect number of features (I missed the proportion of unique parsimony topologies). This release includes the predictor with this feature.

The updated (correct) performance is: MAE = 0.06, MAPE = 1.4.

v1.2.0

15 Oct 11:35

Choose a tag to compare

This release includes the following changes:

New Features

Includes a new command line option --forceDuplicates that forces Pythia to predict the difficulty for an MSA that contains duplicate sequences (default behavior still is to fail).

Updates

We retrained Pythia and optimized the params using Optuna. This slightly increases the performance to a MAE of 0.06 (previously 0.07) and a MAPE of 1.6% (previously 1.7%) 🥳
The new predictor is available as predictor_lgb_v1.2.0.pckl and latest.pckl.

Bug Fixes

Fixes a bug that caused the --shap option to fail due to an update in the shap package API (fixes #15), thanks @computations for the fix!

v1.1.4

25 Sep 16:28

Choose a tag to compare

This release fixes a bug when using the --removeDuplicates option: the RAxML-NG parsimony tree features were computed on the old MSA with duplicate sequences instead of the deduplicated MSA

v1.1.3

19 Aug 20:04

Choose a tag to compare

The provided Pythia predictors are not compatible with the latest major release of LightGBM (v4.0.0) and the new minor scikit-learn release (v1.3). This release pins the versions of both packages to compatible releases.

v1.1.2

30 May 07:38

Choose a tag to compare

Fix issues with shap package

Importing the shap module takes some time, so it will now only be imported if --shap is called.
Also, shap raises NumbaDeprecationWarnings that clutter the output of Pythia, so I suppressed them for now until shap fixes this (see shap/shap#2909)

v1.1.1

26 May 16:40

Choose a tag to compare

Remove Shap support for PyPythia v0.0.0 scikit-learn predictor. This removes the need to pin the Python version to <3.11 and fixing the numpy version.

v1.1.0

26 May 12:44

Choose a tag to compare

We trained Pythia on even more data! Our new, way larger set of training data consists of:

  • 11 108 DNA MSAs
  • 979 Protein MSAs
  • 460 Morphological MSAs
    = 12 547 MSAs, all empirical data of course :-)

The new predictor shows an improved accuracy 🥳

  • Mean absolute error: 0.07 (previously 0.09)
  • Mean absolute percentage error: 1.7% (previously 2.5%)

This new Pythia prediction 1.1.0 is available as predictors/predictor_lgb_v1.1.0.pckl and will replace the last version in predictors/latest.pckl

Changes

  • The new retrained predictor will be the default predictor, so predictors/latest.pckl is identical to predictors/predictor_lgb_v1.1.0.pckl. The previous predictors of Pythia < 1.1.0 are still available and fully supported.
  • Pythia is trained on two additional features: the patterns-over-site ratio and a an entropy-like measurement based on the number and frequency of patterns in the MSA
  • Pythia now supports parallel inference of the parsimony trees with RAxML-NG. You can set the number of threads using the new command line parameter --threads. Note that you need RAxML-NG version ≥ 1.2.0 to use the --threads option.

Introducing Shapley Values (experimental feature)

To allow more detailed insights into the prediction of Pythia, we include shapley values with this version. To get more information on what shapley values are and how to interpret them, refer to the wiki. The new command line parameter --shap will create a so-called waterfall plot and save it as {msa_name}.shap.pdf. Please make sure you understand what shapley values are and what you can infer based on this plot before drawing conclusions!
This new feature is fully backwards compatible with all previous predictors.

v1.0.1

03 Nov 14:58

Choose a tag to compare

New features:

  • allow manual setting of MSA file format
  • include difficulty prediction script that requires no installation

Minor Bug fixes:

  • fix LightGBM issue when using Python multiprocessing
  • use the user defined precision for printing features in verbose mode
  • fix issues with logging when using PyPythia from code

v1.0.0

10 Oct 15:02

Choose a tag to compare

Release Summary

We retrained Pythia using additional data and now include full support of morphological data 🎉
Our new set of training data consists of:

  • 3250 empirical DNA and Protein datasets obtained from TreeBase (same as in version 0.0.1)
  • 538 additional empirical DNA and Protein datasets obtained via our RAxML-Grove
  • 474 additional morphological datasets obtained from TreeBase
  • = 4262 datasets in total

The resulting predictor has about the same accuracy as the previous predictor, with a slight improvement of the mean absolute percentage error:

  • Mean absolute error: 0.09
  • Mean absolute percentage error: 2.5%

We are now using LightGBM’s boosted trees instead of scikit-learn’s random forest

  • Pythia 1.0.0 is backwards compatible to the scikit-learn random forest predictor of Pythia version 0.0.1. This predictor is still available in predictors/predictor_sklearn_rf_v0.0.1.pckl

Breaking Changes

  • The default predictor changed to the new LightGBM predictor ( predictors/predictor_lgb_v1.0.0.pckl). Since this predictor was retrained using additional data, the predictions between previous versions and this version will likely differ. This introduces an additional dependency: LightGBM
  • Identical sequences in the MSA:
    • per default: Pythia refuse to predict the difficulty for MSAs that contain identical sequences
    • new --removeDuplicates option: if the MSA contains duplicate sequences Pythia stores a reduced alignment and predicts the difficulty for this reduced alignment
  • The exceptions in msa.py changed: instead of ValueError, Pythia now raises a custom PyPythiaException.
  • We changed the DataType type definition to an Enum instead of a string, see custom_types.py for more details.
  • We renamed the predictor_path parameter in predictor.DifficulyPredictor to predictor_handle.

Minor Changes

  • Improved logging for command line interface
  • new --quiet mode to suppress intermediate information
  • predictor.DifficulyPredictor now accepts a set of features in it's constructor, allowing predictions with experimental difficulty predictors that were trained using a different set of features than our PyPythia

Full Changelog: 0.0.1...1.0.0