Release Summary

We retrained Pythia using additional data and now include full support of morphological data 🎉
Our new set of training data consists of:

3250 empirical DNA and Protein datasets obtained from TreeBase (same as in version 0.0.1)
538 additional empirical DNA and Protein datasets obtained via our RAxML-Grove
474 additional morphological datasets obtained from TreeBase
= 4262 datasets in total

The resulting predictor has about the same accuracy as the previous predictor, with a slight improvement of the mean absolute percentage error:

We are now using LightGBM’s boosted trees instead of scikit-learn’s random forest

Pythia 1.0.0 is backwards compatible to the scikit-learn random forest predictor of Pythia version 0.0.1. This predictor is still available in predictors/predictor_sklearn_rf_v0.0.1.pckl

Breaking Changes

The default predictor changed to the new LightGBM predictor ( predictors/predictor_lgb_v1.0.0.pckl). Since this predictor was retrained using additional data, the predictions between previous versions and this version will likely differ. This introduces an additional dependency: LightGBM
Identical sequences in the MSA:
- per default: Pythia refuse to predict the difficulty for MSAs that contain identical sequences
- new --removeDuplicates option: if the MSA contains duplicate sequences Pythia stores a reduced alignment and predicts the difficulty for this reduced alignment
The exceptions in msa.py changed: instead of ValueError, Pythia now raises a custom PyPythiaException.
We changed the DataType type definition to an Enum instead of a string, see custom_types.py for more details.
We renamed the predictor_path parameter in predictor.DifficulyPredictor to predictor_handle.

Improved logging for command line interface
new --quiet mode to suppress intermediate information
predictor.DifficulyPredictor now accepts a set of features in it's constructor, allowing predictions with experimental difficulty predictors that were trained using a different set of features than our PyPythia

Full Changelog: 0.0.1...1.0.0