This repository contains an implementation of the multi-sample prize-collecting Steiner forest (Multi-PCSF) algorithm described in Gitter et al 2014. This code is provided for reproducibility of the results in the manuscript but is no longer under active development. The Omics Integrator website describes how to install the msgsteiner dependency required by Multi-PCSF.
Omics Integrator 2 from the Fraenkel laboratory contains a re-implementation of Multi-PCSF with additional features, such as support for a hierarchical clustering of samples.
BreastCancer.sh
in the scripts subdirectory provides an example of how to run
Multi-PCSF. Before running the script, the msgpath
variable must be set to
the location of the msgsteiner executable, including the file name.
The breast cancer tumor sample data and protein-protein interaction network data described in the Multi-PCSF manuscript are provided as an example dataset. If you use these data in a manuscript, cite TCGA 2012 for the breast cancer data and Szklarczyk et al 2011 for the STRING protein-protein interaction network and see their respective websites (TCGA, STRING) for the terms of use.
The Science Signaling Database of Cell Signaling EGFR pathway that was used to
simulate samples is also provided in the data
subdirectory. If you use this
pathway in a manuscript, cite Gough 2002 and see the Database of Cell
Signaling website for the terms of use.
Only the most commonly used options are described below. Use python ConstrainedMultiSample.py -h
to view the complete usage message. See the
provided example data for file formatting guidelines. Please open an issue with
any usage questions.
Usage: ConstrainedMultiSample.py [options]
Options:
-h, --help show this help message and exit
--interactomepath=INTERACTOMEPATH
This path points to the directory that contains the
interaction network files
--terminalpath=TERMINALPATH
This path points to the directory that contains the
terminal (node prize) files
--resultpath=RESULTPATH
This path points to the directory where the output
files will be written.
--undirectedfile=UNDIRECTEDFILE
The name of the protein-protein interaction file in
the interactomepath directory. The file is expected
to contain undirected interactions with probabilistic
weights (e.g in [0,1]). Columns should be ordered
[prot1 prot2 weight].
--terminalfile=MASTERTERMINALFILE
A file in the terminalpath directory that lists the
files that give the node prizes for each sample. All
listed filenames should be relative to terminal path.
If gene penalties are given in the terminal files,
gene names should end with '_MRNA'. Optionally can
include a tab-separated second column that assigns
each sample to a group so the forests are only
constrained to be similar to other samples in the same
group.
--msgpath=MSGPATH The path and file name of the msgsteiner executable
--depth=DEPTH Depth parameter that limits the maximum depth from the
Steiner tree root to the leaves
--W=W The cost of the edges from the artificial root node to
its neighbors.
--beta=BETA The scaling factor applied to the node prizes, which
is used to control the relative strength of node
prizes and edge costs. This scaling is only performed
once when the initial stp files are created.
--lambda=LAMBDA1 The tradeoff coefficient for the penalty incurred by
nodes in the Steiner forests that are not in the set
of common nodes.
--alpha=LAMBDA2 The tradeoff coefficient for the reward on the size of
the set of common nodes when using unweighted
artificial prizes or the power to which the node
frequency is taken for weighted prizes.
--mu=MU A parameter used to penalize high-degree nodes from
being selected as Steiner nodes. Does not affect
prize nodes but does affect artificial prizes. The
penalty is -mu*degree. Set mu <= 0 to disable the
penalty (default).
--iterations=ITERATIONS
The number of iterations to run
--workers=WORKERS The number of worker processes to use in the
multiprocessing pool or threads to use in multi-
threaded belief propagation. Should not exceed the
number of cores available. Defaults to the number of
CPUs.
--artificialprizes=ARTIFICIALPRIZES
Use 'positive' or 'negative' prizes to encourage trees
to include common set proteins. Use
'positiveWeighted' or 'negativeWeighted' (default)
prizes to construct weighted artificial prizes based
on the node frequency in the most recent forests.
--dummyneighbors=DUMMYNEIGHBORS
Connect the dummy node to all 'prizes' (default) or
'nonprizes'.
--itermode=ITERMODE Learn forests simultaneously in 'batch' (default) or
sequentially in 'random' order. Batch mode computes
artificial prizes with respect to all forests at the
previous iteration. Random mode computes prizes for a
specific sample given the most recent forests for all
other samples.
Several subdirectories are created in the directory specified by the
--resultpath
argument. The initial
and itr*
directories (one for each of
the iterations specified by the --iterations
argument) provide detailed
information about intermediate results. Except for the last itr*
directory,
these can typically be deleted after Multi-PCSF terminates.
The location of the final Multi-PCSF networks depends on the settings. If
--artificialprizes
was set to one of the negative prize options or only one
iteration was run, the output networks are in the last itr*
directory. If
positive artificial prizes were used, a post-processing pruning step is
executed. This runs the Steiner forest algorithm once more for each sample to
prune nodes in the network that do not connect real prize nodes to the forest
but rather were included only due to their positive artificial prizes. In this
case, the output networks are in the final
directory.
The output directory contains intermediate files and the following files that
are most useful for interpreting and visualizing the networks. For each input
file <sample>
listed in the --terminalfile
input file, several output files
will be created:
symbol_<sample>_<options>.txt
:<sample>
is the input sample name and<options>
are the values of theW
,beta
, anddepth
arguments. This space-separated file contains a line for each edge in the output network, where each line provides the names of the interacting proteins. The artificial root nodeDUMMY
is still present. This is typically the most relevant representation of the output network. The edges are the same as the edges in the msgsteiner output file<sample>_<options>.txt
.symbol_fullnetwork_<sample>_<options>.txt
:<sample>
is the input sample name and<options>
are the values of theW
,beta
, anddepth
arguments. This tab-separated file contains a line for each edge in the output network. The artificial root node has been removed. Thesteiner
edges are the edges from the optimal Steiner forest. Theintra
edges are additional edges that have been added back to the Steiner forest, which are sometimes useful for identifying alternative pathway connections.<sample>_<options>.output
: Summary statistics of the Steiner forest produced.<sample>_<options>.objective
: Output messages from the msgsteiner program, including optimization progress.
The other files are intermediate files used to create the input for msgsteiner or prepare the output network file from the msgsteiner output.
The simulation
subdirectory contains the code that was used to simulate
input samples from synthetic or real pathways. This code currently serves as
extended documentation and is not runnable. It uses an old version of
ConstrainedMultiSample.py
and needs to be updated to use the refactored
version, which accepts different command line arguments.
- Implement multi-sample functionality in Omics Integrator
- Refactor simulation code to generate prizes from known or synthetic pathways
- Document support for distinct groups of samples
- Nurcan Tuncbag
- Anthony Gitter
Portions of the Multi-PCSF software were developed with support from Microsoft Research while Anthony Gitter was a postdoctoral researcher there. We thank Microsoft for granting permission to release the code as open source under the Sample Code Exception and Paul Oka in particular for coordinating the release. We acknowledge all authors of Gitter et al 2014 for their role in the algorithm development.