Skip to content

Latest commit

 

History

History
164 lines (141 loc) · 12.8 KB

EXPERIMENTS.md

File metadata and controls

164 lines (141 loc) · 12.8 KB

Experiments

This file is intended to serve as a guide for reproducing the results presented in our paper.

Contents

Artifact Description

We have developed a parallel algorithm for learning module networks in parallel, based on the sequential Lemon-Tree algorithm by Bonnet et al. In the experiments described below, we compare the performance of the original Lemon-Tree implementation with that of our software and also measure the parallel performance of our algorithm. We developed two main artifacts for this purpose:

  1. ParsiMoNe
    This software implements the original sequential algorithm and our parallel algorithm for generating module networks.
  2. Modified Lemon-Tree
    In order to compare the performance of ParsiMoNe with that of Lemon-Tree, we modified Lemon-Tree to use the same PRNG as ParsiMoNe and made some other optimizations in Lemon-Tree that we had implemented in ParsiMoNe. These changes were made for the two implementation to produce the same output for the same input data set and parameters.

Run-time Environment

We provide a summary of the hardware and the software required for running the experiments. The details of the run-time environment that we used for our experiments (collected using the file collect_environment.sh) are available in phoenix_environment.log.

Hardware

We used the Phoenix cluster at Georgia Tech for our experiments. Each node in the cluster has a 2.7 GHz 24-core Intel Xeon Gold 6226 processor and main memory of 192 GB or more. The nodes are connected via HDR100 (100 Gbps) InfiniBand. More details on the cluster resources can be found in the cluster documentation.

Software

We conducted our experiments on nodes running Linux RHEL v7.6 operating system. We used the following versions of the compiler and other libraries for experimenting with ParsiMoNe.

The purpose of all the libraries is explained in more detail in README.md. We also used OpenJDK v1.8.0_262 and the corresponding server VM for executing the original Lemon-Tree implementation.

Initial Setup

Cloning and Building

ParsiMoNe

ParsiMoNe can be downloaded by cloning this Github repo, along with all the submodules, by executing the following:

git clone --recurse-submodules [email protected]:asrivast28/ParsiMoNe.git

Once all the requirements have been installed, the executable for measuring performance can be built by executing the following:

scons TIMER=1

More information on different build options can be found in README.md

Lemon-Tree

The modified Lemon-Tree can be downloaded and built by executing the following:

git clone [email protected]:asrivast28/lemon-tree.git
cd lemon-tree/LemonTree
git checkout -b MatchOutput origin/MatchOutput
ant jar

The modified Lemon-Tree uses the random number generators from the TRNG library, made available in Java using Java Native Interface. The corresponding library needs to be built separately by executing the following:

cd lemon-tree/LemonTree/src/lemontree/utils
make

Data sets

We used the following two gene expression data sets from two model organisms for our experiments: Saccharomyces cerevisiae, or Baker's yeast, and the Arabidopsis thaliana plant.

  • A data set created by Tchourine et al. from multiple RNA-seq studies of S. cerevisiae. The data set contains 2,577 observations for 5,716 genes and it can be downloaded from Zenodo.
  • A manually curated data set created from multiple microarray studies of A. thaliana focusing only on the studies of the development process in the plant. The data set contains 5,102 observations for 18,373 genes and it can also be downloaded from Zenodo.

Validating Setup

The experimental setup can be validated using smaller data sets as described below.

Generating Smaller Data sets

Smaller data sets from the complete yeast or A. thaliana data set can be generated for validation purpose, using common Linux command line utilities. For example, a data set with the first 100 observations for the first 100 variables in the yeast data set can be obtained by executing the following:

n=100;m=100;head -$(($n+1)) yeast_microarray_expression.tsv | cut -d $'\t' -f 1-$(($m+1)) > yeast_n${n}_m${m}.tsv

The values of n and m in the command can be varied to generate multiple smaller data sets.

Running ParsiMoNe

Using the above generated smaller data set, module network can be generated sequentially using ParsiMoNe by executing the following:

./parsimone -f yeast_n100_m100.tsv -n 100 -m 100 -c -v -i -s $'\t' -g experiment_configs_seed0.json -o sequential_parsimone

This will create a directory called sequential_parsimone with all the files. Similarly, it can be executed in parallel for generating module networks as follows:

mpirun -np 4 ./parsimone -f yeast_n100_m100.tsv -n 100 -m 100 -c -v -i -s $'\t' -g experiment_configs_seed0.json -o parallel_parsimone

The generated networks are expected to be the same, irrespective of the number of processors used. This can be verified using the script compare_lemontree.py that we created for the purpose, by executing the following:

common/scripts/compare_lemontree.py sequential_parsimone parallel_parsimone

Running Lemon-Tree

The original Lemon-Tree, built as described above, can be used for generating module networks using the script parsimone_lemontree.py. This script accepts the same arguments as ParsiMoNe and can be executed as:

common/scripts/parsimone_lemontree.py -f yeast_n100_m100.tsv -n 100 -m 100 -c -v -i -s $'\t' -g experiment_configs_seed0.json -o lemontree_parsimone

Again, given the same input data set and parameters, we expect Lemon-Tree to generate the same network as ParsiMoNe. This can be verified as:

common/scripts/compare_lemontree.py sequential_parsimone lemontree_parsimone

Measuring Performance

We provide a Python script, parsimone_experiments.py, for easily experimenting with ParsiMoNe as well as Lemon-Tree and measuring their performance. The commands using the script below expect that the script is executed from the ParsiMoNe directory cloned above and the two data sets are available at the following paths in the directory: data/yeast/yeast_microarray_expression.tsv and data/athaliana/athaliana_development_exp.tsv

Sequential Performance

We compared the sequential performance of ParsiMoNe with that of Lemon-Tree for learning module networks.
We obtained the run-times of our implementation for 15 different subsamples of the yeast data set using three different random seeds by executing the following:

for seed in {0,1,2}; do
  common/scripts/parsimone_experiments.py -r 1 -p 1 -d yeast -n 1000 2000 3000 -m 125 250 500 750 1000 -g "\-g experiment_configs_seed${seed}.json" --results ours_yeast_sequential_seed${seed}.csv -b . -s . --output-suffix _seed${seed}
done

The above run is expected to take about six days and will generate three CSV files with the run-times of different components in our implementation: ours_yeast_sequential_seed0.csv, ours_yeast_sequential_seed1.csv, and ours_yeast_sequential_seed2.csv.

Using the script parsimone_lemontree.py described earlier, the sequential performance of Lemon-Tree for the 15 subsampled data sets can be measured similar to that of our implementation by executing the following:

for seed in {0,1,2}; do
  common/scripts/parsimone_experiments.py -r 1 -p 1 -d yeast -n 1000 2000 3000 -m 125 250 500 750 1000 -g "\-g experiment_configs_seed${seed}.json" --results lemontree_yeast_sequential_seed${seed}.csv -b . -s . --output-suffix _seed${seed} --lemontree
done

Again, this will generate three CSV files with the run-times of different Lemon-Tree components: lemontree_yeast_sequential_seed0.csv, lemontree_yeast_sequential_seed1.csv, and lemontree_yeast_sequential_seed2.csv. This run is expected to take about 21 days.

Then, the outputs generated by Lemon-Tree and our implementation in the above runs for different subsampled data sets and different PRNG seeds can be compared, using compare_lemontree.py, by executing the following:

for n in {1000,2000,3000}; do
  for m in {125,250,500,750,1000}; do
    for seed in {0,1,2}; do
      common/scripts/compare_lemontree.py yeast_n${n}_m${m}_lemontree_seed${seed} yeast_n${n}_m${m}_seed${seed}
    done
  done
done

Parallel Performance

Smaller Data sets

First, we measured the strong scaling parallel performance of ParsiMoNe for learning the network for all the variables in the yeast data set using a subset of observations in the data set. The following can be executed for the purpose:

for seed in {0,1,2}; do
  common/scripts/parsimone_experiments.py -r 1 --ppn 24 -p 1 2 4 8 16 32 64 128 256 512 1024 -d yeast -n 5716 -m 125 250 500 750 1000 -g "\-g experiment_configs_seed${seed}.json" --results parallel_yeast_small_seed${seed}.csv -b . -s . --output-suffix _seed${seed}
done

Similar to the sequential performance, this will generate seed-specific CSV files (parallel_yeast_small_seed0.csv, etc.) with run-times of ParsiMoNe when using different number of cores. All these runs are expected to take about 24 days.

This command automatically compares the output generated when using different number of processors with the first output generated for every combination of n and m. Therefore, in this case, it compares against the output generated by p=1 and errors out in case of any mismatches.

Big Data sets

Then, we measured the parallel run-times for learning module networks from the two complete data sets.
For the yeast data set, we conducted strong scaling exxperiments by executing the following:

for seed in {0,1,2}; do
  common/scripts/parsimone_experiments.py -r 1 --ppn 24 -p 4 8 16 32 64 128 256 512 1024 2048 4096 -d yeast -g "\-g experiment_configs_seed${seed}.json" --results parallel_yeast_complete_seed${seed}.csv -b . -s . --output-suffix _seed${seed}
done

This run is expected to take approximately 26 days.

Since learning network from the A. thaliana development data set requires a lot of time, we experimented with the data set only on larger number of cores as follows:

for seed in {0,1,2}; do
  common/scripts/parsimone_experiments.py -r 1 --ppn 24 -p 1024 2048 4096 -d development -g "\-g experiment_configs_seed${seed}.json" --results parallel_development_complete_seed${seed}.csv -b . -s . --output-suffix _seed${seed}
done

This run is expected to take approximately 13 days.