Skip to content

Latest commit

 

History

History
105 lines (89 loc) · 7.9 KB

EXPERIMENTS.md

File metadata and controls

105 lines (89 loc) · 7.9 KB

Experiments

This file is intended to serve as a guide for reproducing the results presented in our paper.

Contents

Run-time Environment

We provide a summary of the hardware and the software required for running the experiments. The run-time environment for the experiments (collected using the file collect_environment.sh) is available at hive_environment.log

Hardware

We used the Hive cluster at Georgia Tech for our experiments. Each node in the cluster has a 2.7 GHz 24-core Intel Xeon 6226 processor and main memory of 192 GB or more. The nodes are connected via EDR (100 Gbps) InfiniBand. More details on the cluster resources can be found in the cluster documentation.

Software

We conducted our experiments on nodes running Linux RHEL v7.6 operating system. We used the following versions of the compiler and other libraries for experimenting with ramBLe.

The purpose of all the libraries is explained in more detail in README.md.

Initial Setup

Cloning

ramBLe can be downloaded by cloning this Github repo, along with all the submodules, by executing the following for cloning via SSH:

git clone --recurse-submodules [email protected]:asrivast28/ramBLe.git

or the following for cloning via HTTPS:

git clone --recurse-submodules https://github.com/asrivast28/ramBLe.git

Similarly, the tree at a particular revision can be checked out along with the corresponding version of all the submodules by executing the following:

git checkout --recurse-submodules <tree-ish>

Building

The simplest way to build the code for measuring performance is to execute the following:

scons TIMER=1

More information on building can be found in README.md

Data sets

We used the following three gene expression data sets from two model organisms, Saccharomyces cerevisiae and Arabidopsis thaliana, for our experiments.

  • A data set created by Tchourine et al. from multiple RNA-seq studies of S. cerevisiae
    2,577 observations for 5,716 genes; can be downloaded from Zenodo
  • An unpublished data set created from multiple microarray studies of A. thaliana
    16,838 observations for 18,380 genes; will be made available soon
  • A manually curated subset of the above data set focusing only on the studies of the development process in A. thaliana 5,102 observations for 18,373 genes; can also be downloaded from Zenodo

Discretization

We discretize the expression levels in the data sets using the methodology suggested by Fridman et al.
For example, the S. cerevisiae data set can be discretized (using discretize.py) as follows:

common/scripts/discretize.py -f yeast_microarray_expression.tsv -s '\t' -c -v -i -o yeast_microarray_expression_discretized.tsv

Measuring Performance

Sequential Execution

ramBLe can be used for learning a Bayesian network, using any of the supported algorithms, as described in README.md
For example, in order to measure the performance of our network for learning the network from the S. cerevisiae data set using the GS algorithm, the following can be executed:

./ramble -n 5716 -m 2577 -f yeast_microarray_expression_discretized.tsv -s '\t' -c -v -i -a gs -o yeast_network.dot -d

Running bnlearn

We have also provided an Rscript, ramble_bnlearn.R, for running bnlearn with the same arguments as ramBLe.
For example, the performance of bnlearn in learning the network from the S. cerevisiae data set using the GS algorithm can be measured by executing:

common/scripts/ramble_bnlearn.R -n 5716 -m 2577 -f yeast_microarray_expression_discretized.tsv -s '\t' -c -v -i -a gs -o yeast_network.dot -d

Parallel Execution

The performance of ramBLe when run in parallel using MPI can be measured as follows:

mpirun -np 16 ./ramble -a gs -f yeast_microarray_expression_discretized.tsv -n 5716 -m 2577 -s '\t' -c -v -i -o yeast_network.dot -d

The above command will run ramBLe using 16 MPI processes for the discretized yeast data set and learn a Bayesian network using gs algorithm (a list of the algorithms supported by ramBLe for learning can be found in README.md). ramBLe also requires the following details about the layout of the data set in the file in order to read it correctly:

  • number of variables (-n),
  • number of observations (-m),
  • file delimiter (separator) (-s),
  • if the observations are arranged in columns in the file (-c; default mode assumes observations are arranged in rows),
  • if the first row/column of variables provides variable names (-v), and
  • if the first row/column of observations provides observation identifiers (-i).

The learned network will be written to the file yeast_network.dot and the command will print out the time taken in getting the network as well as the time taken by different components on the standard output.

Scalability Experiments

We have provided a utility script, ramble_experiments.py, for running the scalability experiments using ramBLe. This script can be used for experimenting with data sets on different number of processors and record the measured run-times in a CSV file.

As an example, the script can be used for automatically parsing the ramBLe run-times for 5 different runs when running on 16 processes with the discretized yeast data set by executing the following:

common/scripts/ramble_experiments.py -p 16 -r 5 -a gs -d yeast_microarray_expression_discretized.tsv -s '\t' -c -v -i  --results performance_yeast_p16.csv

This will generate a file named performance_yeast_p16.csv with the run-times for the different runs. The runs can be customized using different arguments to the script, which can be seen by executing:

common/scripts/ramble_experiments.py -h