Avx Windowed fm-index in Rust? Yes!
Generates an Fm-Index of a given biological sequence text (Fasta or Fastq file), and implements Locate() and Search() functionalities.
AWRY is a port of a state-of-the-art, fastest in its class FM-index implementation (https://doi.org/10.1186/s13015-021-00204-6). AWRY supports parallelized searching, with parallel_count() and parallel_locate() functions. As AWRY is powered by custom SIMD operations, the library currently supports x86_64 and aarch64 machines with AVX2 and ARM Neon.
AWRY supports DNA, RNA, and protein alphabets, and is able to search at lightning speed by leveraging SIMD vectorization and multithreading over collections of queries. These alphabets are case-insensitive- A
and a
are considered the same symbol under the hood.
to build an fm-index, create an FmBuildArgs struct, and call FmIndex::new()
let buildArgs = FmBuildArgs {
input_file_src: "my_input.fa", //sets what the input file for the database text will be
suffix_array_output_src: None, //will build to a default location
suffix_array_compression_ratio: None, // ratio of suffix array compression, 8 by default
lookup_table_kmer_len: None, //by default, chooses reasonable table sizes (Dna=13, Amino=5)
alphabet: SymbolAlphabet::Nucleotide, //alphabet to build
max_query_len: None, //if set, only sort suffix array up to n positions
remove_intermediate_suffix_array_file: true,//deletes the suffix array file if true
}
let fm_index = FmIndex::new(&buildArgs);
If you only intend to use the count function, you can set the suffix array compression to a high value like 255 to reduce memory usage.
To search for a query, use to count_string and locate_string functions.
pub fn count_string(&self, query: &String) -> u64 {
...
}
/// Finds the locations in the original text of all isntances of the given query.
pub fn locate_string(&self, query: &String) -> Vec<u64> {
...
}
To find a large number of queries, searching can be parallelized easily with the parallel_count and parallel_locate functions
pub fn parallel_count(&self, queries: &Vec<String>) -> Vec<u64> {
...
}
// Finds the locations for each query in the query list. This function uses rayon's into_par_iter() for parallelism.
pub fn parallel_locate(&self, queries: &Vec<String>) -> Vec<Vec<u64>> {
...
}