./install_fsm.sh
- Installs
fsm-lite, if not already installed. - Installs
Eigen 3.3.4, if not already installed. - Installs
googletest, if not already installed.
Scripts that can be used to run C++ tools locally
./prepare_data.sh <zip-file> <data-name>
- Unzip data from
zip-file(give full path of the zip file), preprocess it forfsm-lite, and save the files todata/data-name - Removes
spades.fa- files, and renames rest fasta-files intof0001,f0002,... - Outputs the names of the original files as
data_name_file_listinto thedatadirectory.
./read_kmers.sh <data-name> <n-points>
- Wrapper for
fsm-lite, reads kmers counts into the sparse matrixdata/<data-name><n_points>/<data-name><n_points>.matfrom the specified fasta-files, that reside in the directorydata/data-name. - Argument
data-nameis the name of the data set, for exampleEcol. - Argument
n_pointscontrols how many first points you want to read, for example 250. - Creates for example file
data/Ecol250/Ecol250.mat.
./writer.sh <data-name> <n_train> <n_test> <counts>
- Divides the data set into a training set with
n_trainpoints and test set withn_testpoints and writes these to directorydata/data-name/as filestrain.binandtest.bin. Dimensions of the data set are written todimensions.sh. Wrapper forbinary_writer/binary_writer. - Assumes that the data set with
n_pointspoints is written byread_kmers. - Argument
countscontrols if the kmer counts (counts=1) in samples are written, or only binary (counts=0) yes/no (kmer is in sample or not).
./comparison.sh <data-name> <n> <postfix>
- Run exact k-NN search and approximate k-NN search with the MRPT algorithm.
- Wrapper for
exact/testerandmrpt/mrpt_comparison. - Assumes that the parameters of the test run are saved in the file
parameters/<data-name><n><postfix>.shorparameters/<data-name><n>.sh(parameter<postfix>is optional). - If data set (such as
mnist) name has no sample size, you can give empty string ("") as the second argumentn. - Saves results into a directory
results/<data-name><n><postfix>(or intoparameters/<data-name><n>, respectively).
./file_finder.sh <data_name> <k>
- write nearest neighbors of test set to file
kis number of nearest neighbors written- exact results should exist in
results/data-name-exact/truth_k-file - results (one file for each of the point of the test set) are written into the directory
results/data-name-exact/file_names
For SLURM scripts remember to set
- upper limit for memory, for example 5 gigabytes:
#SBATCH --mem=5G - upper limit for computing time, for example one hour:
#SBATCH --time=01:00:00
Scripts that can be used to run the C++ tools in SLURM system are in the directory wrapper-SLURM:
prepare_data_slurm.sh <zip-file> <data-name>
- Slurm wrapper for
prepare_data.sh.
read_kmers_slurm.sh <data-name> <n-points>
- Wrapper for
fsm-lite, has same arguments asread_kmers.sh. - Set variable
BASE_DIRto your local clone of this repo, for exampleBASE_DIR=/home/mydir/genome_test
writer_slurm.sh <data-set-name> <n_train> <n_test> <counts>
- Wrapper for
binary_writer/binary_writer, same functionality aswriter.sh. - Set variable
BASE_DIRto your local clone of this repo, for exampleBASE_DIR=/home/mydir/genome_test - For
Ecoldata set with 1500 points#SBATCH --mem=150Gand#SBATCH --time=02:00:00are good values.
comparison_slurm.sh <data-name> <n> <postfix>
- Wrapper for
exact/testerandmrpt/mrpt_comparison, same functionality ascomparison.sh. - Set variable
BASE_DIRto your local clone of this repo, for exampleBASE_DIR=/home/mydir/genome_test
python plot.py <k> results/<result-name1>/mrpt.txt results/<result-name2>/mrpt.txt
- plots running time vs. accuracy for
k-nn queries. - one line for each of the results file.
- uses sparsity values (expected proportion of the non-zero components in the random vectors) in the legend.
- configuration is done straight to the script:
n_test: test set size.legend: draw legend or not.save: is file saved into a file calledfile_nameor showed.log: is the scale of y-axis logarithmic or linearset_ylim: is the limit of y axis set toylim, or show all data points.legend_label: which attribute is used for legend; current choices aresparsity,depth, andfilename.show_title: add title given by the argumenttitleto plots.exact_time: time of exact search for one query point.
get_mnist.sh
- load mnist data set into
data/mnist/for testing. - converts it into binary form (float array in saved in col-major form, dimension of data is
d = 784). - loads the whole data set (
data.bin), and divides it into a training set (train.bin) and a test set (test.bin); the test set hasTEST_N = 100points the and training set has 59900 points with this value ofTEST_N.
Link to the automatically generated documentation
