Sempervirens: A Fast Reconstruction Algorithm for Noisy and Incomplete Matrix Representations of Trees
This archive is distributed in association with the INFORMS Journal on Computing under the MIT License.
The software and data in this repository are a snapshot of the software and data that were used in the research reported on in the paper Sempervirens: A Fast Reconstruction Algorithm for Noisy and Incomplete Matrix Representations of Trees. The snapshot is based on this SHA in the development repository.
**Important: This code is being developed on an on-going basis at https://github.com/nevenag/sempervirens. Please go there if you would like to get a more recent version or would like support.
To cite the contents of this repository, please cite both the paper and this repo, using their respective DOIs.
https://doi.org/10.1287/ijoc.2023.0373
https://doi.org/10.1287/ijoc.2023.0373.cd
Below is the BibTex for citing this snapshot of the repository.
@misc{Sempervirens
author = {N. Junnarkar, C. Kizilkale, N. Golubovic, M. Arcak, A. Buluc},
publisher = {INFORMS Journal on Computing},
title = {{Sempervirens: A Fast Reconstruction Algorithm for Noisy and Incomplete Matrix Representations of Treest}},
year = {2023},
doi = {10.1287/ijoc.2023.0373.cd},
url = {https://github.com/INFORMSJoC/2023.0373},
note = {Available for download at https://github.com/INFORMSJoC/2023.0373},
}
- Quick Start
- Installation
- Commandline Usage
- Library Usage
- Input Output Format
- Using the Rust Implementation
- Files and Directories
- Contact Information
Assuming Python and dependencies are present you can run:
git clone [email protected]:nevenag/sempervirens.git
cd sempervirens
python sempervirens/reconstructor.py sample_data/noisy_data.SC 0.001 0.2 0.05 -o reconstructed_data.SC.CFMatrix
The file reconstructor.py
is a standalone implementation; the file can be moved to wherever necessary.
Sempervirens requires Python and only two Python packages: NumPy and Pandas. Pandas can be omitted if only using Sempervirens as a library.
Python dependencies can be installed as follows:
brew install [email protected]
brew install pip
pip install numpy
pip install pandas
Follow the steps from Quick Start or below to continue.
All code has been tested with Python 3.10.
The following reconstructs the noisy matrix in noisy_matrix_filename
given false positive probability fpp
, false negative probability fnp
, and missing entry probability mep
. The reconstructed matrix is written to noisy_matrix_filename.CFMatrix
.
python reconstructor.py noisy_matrix_filename fpp fnp mep
The output file can be specified with the optional -o
flag:
python reconstructor.py noisy_matrix_filename fpp fnp mep -o output_filename
Help information can be found by running python reconstructor.py --help
.
The reconstructor.py
file can be imported and used as a library in other Python code.
Example usage is as follows.
from reconstructor import reconstruct
...
reconstruction = reconstruct(noisy_mat, fpp, fnp, mep)
...
An example of using Pandas to read and write matrices of files can be seen at the bottom of reconstructor.py
.
The input and output to the reconstruct
function is a NumPy matrix with n
rows and m
columns. This describes n
objects in terms of m
characters (for example, cells in terms of mutations). Element (i, j)
of the matrix can be either 0
, 1
, or 3
. If it is 0
, then object i
does not have character j
. If it is 1
, then object i
has character j
. If it is 3
, then it is unknown whether object i
has character j
(the output of reconstruct
will not have any 3
s).
Below is an example file used for input. The input file must be a tab separated value (TSV) file. The first row and column of the file are used as labels of the rows and columns respectively. The rest of the TSV must represent a matrix with the format described above. The output file will be of the same format as the input file, reusing the input file's row and column labels.
cellID/mutID mut0 mut1 mut2 mut3 mut4 mut5 mut6 mut7
cell0 0 0 3 0 0 0 0 0
cell1 0 3 1 0 0 0 1 1
cell2 0 0 1 0 0 0 1 1
cell3 1 1 0 0 0 0 0 0
cell4 0 0 1 0 3 0 0 0
cell5 1 0 0 0 0 0 0 0
cell6 0 0 1 0 0 0 1 1
cell7 0 0 1 0 0 0 0 0
cell8 3 0 0 0 3 0 3 1
cell9 0 1 0 0 0 0 0 0
A faster implementation is provided in sempervirens-rs
. This implementation must be compiled from source.
Once compiled, it is used similarly to the commandline Python implementation. Help information can be found by running ./sempevirens-rs --help
.
First make sure Rust is installed. Then, in the sempervirens-rs
directory, run the following. The binary will be target/release/sempervirens-rs
and can be moved to wherever needed.
cargo build --release
By default, a version of OpenBLAS is compiled into sempervirens-rs
.
For best performance, the Basic Linear Algebra Subprograms (BLAS) installation should be configured.
Configuring BLAS requires changing the following lines in Cargo.toml
:
blas-src = { version = "0.9", default-features = false, features = ["openblas"] }
openblas-src = { version = "0.10", default-features = false, features = ["cblas", "static"] }
as well as the following line at the top of sempervirens-rs/src/lib.rs
:
extern crate openblas_src;
Information on this can be found in blas-src. Examples for using system OpenBLAS and Intel MKL are below.
To use the system OpenBLAS installation (if available), change the lines in Cargo.toml
to:
blas-src = { version = "0.9", default-features = false, features = ["openblas"] }
openblas-src = { version = "0.10", default-features = false, features = ["cblas", "system"] }
and the line in lib.rs
to:
extern crate openblas_src;
To use the system Intel Math Kernel Library (Intel MKL) installation (if available), change the lines in Cargo.toml
to:
blas-src = { version = "*", default-features = false, features = ["intel-mkl"] }
intel-mkl-src = { version = "0.8.1", features = ["ENTER_INSTALLATION_HERE"] }
and the line in lib.rs
to:
extern crate intel_mkl_src;
The text "ENTER_INSTALLATION_HERE" should be configured according to intel-mkl-src.
sempervirens
:
reconstructor.py
: The reconstruction code and the commandline interface code.metrics.py
: Functions for metrics; e.g. ancestor-descendant score.
sempervirens-rs
: The Rust implementation.
src/lib.rs
: The reconstruction code.src/main.rs
: The commandline interface code.
tests
: Code for evaluating algorithms on datasets and comparing them.
data
: Datasets for testing algorithms.
- Get input data from
data.zip
. noisy_data.SC
sample input file.
metrics
: Results of running algorithms on various datasets.
For questions, please contact Neelay Junnarkar and Can Kızılkale.