CanSRMaPP is a modeling tool for identifying a minimal feature set describing the metagenome of a cancer cohort.
- Free software: BSD license
- Source code: https://github.com/idekerlab/cansrmapp
- Pytorch 2.5+ with torchaudio, torchvision (tested on 2.5.0)0
- tables
- matplotlib
- numpy
- pandas
- scikit-learn
- scikit-image
- scipy
- Python 3.11+
- CUDA 12.1 _only_ if using GPU
- Note
CUDA is only required for implementations using GPUs; feel free to ignore if not using GPU.
The root CanSRMaPP module automatically detects whether CUDA is set up; cmbuilder and in particular cmsolver will configure themselves to use the GPU if available.
This tool depends on PyTorch and the easiest way to get a clean installation is via Anaconda
conda create -n cansrmapp python=3.11 -y conda activate cansrmapp # install pytorch conda install pytorch torchvision -c pytorch
Building and installing cansrmapp package
git clone https://github.com/idekerlab/cansrmapp cd cansrmapp pip install -r requirements_dev.txt make dist pip install dist/cansrmapp*whl
To fit CanSRMaPP models, two scripts are provided in demo/; the simplest invocation is .. code-block:
cd demo ./build.sh ./test-solve.sh
build.sh creates the CanSRMaPP input matrices; test-solve.sh solves them. In the interest of low runtime and debugging, some parameters in test-solve.sh have been set such that they may not converge on optimal solutions; those in full-solve.sh are set to produce an optimal solution.
- Note
- Anecdotally, you can expect a single cycle of cmsolver to take about 1 minute on a GPU and up to 20 minutes when parallelized over multiple CPUs. Parallelization largely takes place from backends handled by numpy, scipy, and pytorch, so if you wish to limit parallelization, follow their advice for setting environment variables.
CanSRMaPP relies on a number of third-party files for reference and reconciling multiple data sources. This document describes the provenance of all such files, and hosts frozen copies since some may be updated in-place by the maintainers.
Homo_sapiens.gene_info
was downloaded from
https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz on
November 3, 2024. This file is unrestricted as described here
GCF_000001405.40_GRCh38.p14_genomic.gff.gz
was downloaded from this FTP directory on November 12, 2024.
This file is unrestricted as described according to these terms
The reduced file gff_reduced.gff.gz derived from this one is the result of running the command
gunzip -c GCF_000001405.40_GRCh38.p14_genomic.gff.gz | awk -F' ' '$0 !~ /^#/ && $3 == "gene" && $9 ~/GeneID/ ' | gzip -c > gff_reduced.gff.gz
"NeSTv0" is a precursor of the interaction map found in
Zheng, Kelly, et al., 2021, prior to filtering for mutation-enriched systems.
It is distributed here as nest.pickle
with permission from the authors, and is
subject to the license governing this repository. The file contains a dict object
mapping each system to a set of member gene Entrez IDs. Because systems in this
file are named Clusterx-y
, an additional file, NeST_map_1.5_default_node_Nov20.csv
,
is incorporated to map these to their NEST IDs as published.
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.