README

This repo provides a pipeline for analyzing the stability of taxa in all given alignments. We add the option of first subsampling the given set of alignments to generate a dataset of alignment that contain different numbers of sequences and also vary in sequence length.

Data

The path to the directory containing the data on which the pipeline is to be run needs to be provided in config.yaml as data_folder. This directory should contain subdirectories which themselves contain alignments as nexus or fasta files, or symlinks to such files.

Prerequisites

Python packages that are required for running this pipeline can be found in environment.yml. A conda environment called phylostability for our analysis can be installed using conda env create -f environment.yml.

The non-python utility CONSEL can be downloaded and installed separately using the instructions included on the homepage.

The line defining path_to_consel in the file config.yaml should be updated to reflect the local build of CONSEL

It also requires a local build of raxml-ng, which is extensively documented.

The line defining path_to_raxml in the file config.yaml should be updated to reflect the local build of RAxML

Running the pipeline

To run the workflow, execute:

snakemake -c36

Note that this will use 36 cores, you can adjust accordingly. The number of cores to be used also needs to be updated in config.yaml.

Subsampling datasets

By default, our pipeline subsamples the dataset you input in config.yaml so that num_samples (in config.yaml) alignments are chosen for the analysis. The num_samples alignments are chosen randomly so that different alignment lengths and number of taxa are included (aligmnents are binned according to these parameters and drawn uniformly from the bins). If you want to run the analysis on all provided alignments, set num_samples in config.yaml to be the total number of datasets. Alternatively, you could delete selected_data/ from the data_folder path in Analysis_snakefile and run snakemake --snakefile Analysis_snakefile -cX.

If the pipeline has run successfully, rerunning snakemake -c36 will add num_samples to the already existing run and will perform random forest regression and classification on all datasets together. The default stability_measure used in the analysis is the normalized Taxon Influence Index (normalised_tii), but this can be set to be rf_radius in config.yaml.

Output

The names of datasets that are subsampled and used in the stability analysis are saved in the file {data_folder}/selected_datasets.csv, where data_folder is the name of the directory containing all data as set in config.yaml. All data produced in the analysis is saved in the directory {data_folder}/selected_data/. This includes a number of csv files as well as alignments and trees, which are saved in directories named by the names of the alignment files as provided in the input data. The most important csv file is probably {data_folder}/selected_data/rf_regression_combined_statistics.csv, which contains all summary statistics computed for all taxa, which are then used for training random forests (or a balanced subset of these taxa is used). The following plots summarize results of our stability analysis, including results of random forest regression and classification, and can be found in {data_folder}/selected_data/plots:

Filename	Plot
au_pie_chart.pdf	Pie chart summarizing number of taxa for which AU-test is significant/non-significant and stability_measure is zero/non-zero
au_test_classifier_features	Feature importances of random forest classifier trained to predict AU-test output (significant vs non-significant)
au_test_classifier_results.pdf	ROC curve for classifier predicting AU-test significance vs non-significance
combined_random_forest_features.pdf	Feature importances of random forest classifier and regressor trained to predict stability_measure (classifier predicts TII zero vs non-zero)
discrete_random_forest_model_features.pdf	Feature importances of random forest classifier predicting stability_measure
dist_instability_to_low_bootstrap.pdf	Average normalized topological distance of unstable edges to closest low bootstrap support node ($<70$)
normalised_tii.pdf	Histogram showing number of times each normalized TII value is observed
random_forest_classifier_results.pdf	ROC curve of random forest classifier predicting stability_measure zero vs non-zero
random_forest_model_features.pdf	Feature importances of random forest regression predicting stability_measure
random_forest_results.pdf	Scatterplot of predicted vs true values of stability_measure
rf_radius_num_taxa.pdf	Scatterplot of RF radius vs number of taxa of alignment of the taxon for which we observe that RF radius
rf_radius.pdf	Histogram showing number of times each normalized RF radius is observed
tii.pdf	Histogram showing number of times each TII is observed (not normalized)

The random forests classifiers require a minimum amount of data in each class. If there are not enough datasets for all classes in the classifier, there will be an error message. The remaining analysis will be performed, but no random forest will be built and therefore not all output plots will be computed.

Name		Name	Last commit message	Last commit date
Latest commit History 426 Commits
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
Analysis_snakefile		Analysis_snakefile
README.md		README.md
Snakefile		Snakefile
Subsampling_snakefile		Subsampling_snakefile
config.yaml		config.yaml
environment.yml		environment.yml
test_input_alignment.fasta		test_input_alignment.fasta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Data

Prerequisites

Running the pipeline

Subsampling datasets

Output

About

Releases

Packages

Contributors 3

Languages

matsengrp/phylostability

Folders and files

Latest commit

History

Repository files navigation

README

Data

Prerequisites

Running the pipeline

Subsampling datasets

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages