Skip to content

Latest commit

 

History

History
98 lines (80 loc) · 3.63 KB

README.md

File metadata and controls

98 lines (80 loc) · 3.63 KB

k-mer counting using Chapel

Directory structure

src
├── kmer
│   ├── benchmarking.chpl
│   └── kmer.chpl
└── wordcount
    ├── benchmarking.chpl
    ├── word_counting_parallel.chpl
    └── word_counting_serial.chpl

In the wordcount directory, we are trying to count how many times each unique word of a random generated text (e.g. Lorem Ipsum) is appeared within the text. The goal is to be familiar with Chapel and the Map standard module. We implemented a naive version, without considering performance.

In the kmer directory, in analogous way with the wordcount, we are trying to implement the kmer-counting problem. Since it is computationally demanding for large sequences, we strive to find an efficient parallel solution. So far we have implemented a standalone parallel iterator taking into account memory limitations by using yield statements. Currently, we exploit only shared memory parallel scheme without any distribution technique. There are also many different approaches on this. Jellyfish, a kmer-counting project, leverage pthreads and an efficient lock-free mechanism (compare and swap named as CAS) to avoid the heavy usage of mutexes that kill performance. It should be noted that the Map standard module of Chapel is using also locks. As an optimization, a lock-free structure can be potentially implemented in Chapel with an efficient hash table.

The input of this program is intended to be files with the standardized fasta format. In order to parse these multiline files, we need to extract the valuable information of the genome sequence neglecting the newline characters. To create such a friendly, linear format, we use awk. This helped us to balance the work between threads while not missing any possible kmer when changing lines.

To linearise the fasta file, then:

make fasta FASTA=<file.fasta>

This will create a linear.fasta ready to be parsed for our code.

Expected fasta files should have the following structure. Until now, we don't support multiple fasta format:

>text
CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGG..
CCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGT..
.
.
.

Usage

For the kmer solution:

git clone https://github.com/thodkatz/chapel-kmer.git
cd chapel-kmer/
make fasta FASTA=<file.fasta>
make kmer
./bin/kmer --k=<number of kmers>

Optionally we can override the linear.fasta generated by make fasta when running the program with ./bin/kmer --pathFasta=<new path file>.

The kmer table is generated as a text file with different versions, each one representing a different implementation but with the same output format. Currently we have one serial and one parallel version. For validation, we can check the differences between these generated files. The serial ones is used mainly for validation.

Output format

(file.fasta)

>Rosalind_6431
CTTCGAAAGTTTGGGCCGAGTCTT
(kmer_parallel.txt)

kmer:count
CGA:2
GTT:1
AGT:2
GAA:1
TGG:1
GAG:1
TTG:1
TTC:1
GCC:1
CTT:2
GTC:1
CCG:1
AAG:1
TTT:1
AAA:1
TCT:1
GGG:1
TCG:1
GGC:1

Related kmer-projects