suffixient-array
is a software providing two main functionalities: 1) construction of a suffixient set of smallest cardinality
git clone https://github.com/regindex/suffixient-array
cd suffixient-array
mkdir build
cd build
cmake ..
make
The \texttt{suffixient-array} tool requires
- A Linux 64-bit operating system.
- A modern Python 3 release version 3.7 or higher.
- A modern C++11\14 compiler such as
g++
version 4.9 or higher. - The zlib library installed on your system.
You can construct a smallest suffixient set of your input using the suffixient-set.py interface. This interface calls the C++ binaries that process the input text file and generate a suffixient set based on the chosen algorithm. Note that the input file must not contain the characters 0, 1, or 2, as they are reserved for internal use.
usage: suffixient-set.py [-h] [-a ALGORITHM] [-o OUTPUT] [-w WSIZE] [-p MOD] [-t THREADS] input
positional arguments:
input input file name
options:
-h, --help show this help message and exit
-a, --algorithm ALGORITHM
suffixient set construction algorithm (one-pass,linear,fm,PFP)
-o, --output OUTPUT output files basepath (def. input)
-w, --wsize WSIZE PFP: sliding window size (def. 10)
-p, --mod MOD PFP: hash modulus (def. 100)
-t, --threads THREADS
PFP: number of helper threads (def. 0)
You can use the suffixient-array-index.py interface to build and query the suffixient-array index (sA-index). The sA-index currently supports two types of queries: locating one occurrence and finding MEMs. Like the previous tool, it requires an input file that must not contain the characters 0, 1, or 2. In addition, if the opt-sA
index variant or the rlz|bitpacked-text
random access text oracles are selected the input must contain DNA characters (A, C, G, T) only.
All output files including the index files are stored using filenames prefixed by the input text filename.
usage: suffixient-array-index.py [-h] [-v INDEX_VARIANT] [-o ORACLE_VARIANT] [-b] [-e EXTRA_EF_SPACE] [-l] [-p PATTERN_FILE] input
positional arguments:
input input files basepath
options:
-h, --help show this help message and exit
-v, --index-variant INDEX_VARIANT
suffixient-array index variant: sA(def.)|opt-sA|PA
-o, --oracle-variant ORACLE_VARIANT
random access text oracle variant: lz77(def.)|rlz|bitpacked-text
-b, --build-index build suffixient-array index
-e, --extra-ef-space EXTRA_EF_SPACE
opt-sA: maximum allowed extra space (percentage) on top of the suffixient-array: Def. 30
-l, --locate-one-occ run locate one occurrence queries
-m, --find-mems run find MEMs queries
-p, --pattern-file PATTERN_FILE
file containing the patterns to locate (fasta format required)
You can construct the sA-index with different parameter combinations. Below is a list of the main configurations:
Best Performance (DNA) : For the fastest index configuration on DNA sequences (A, C, G, T characters only), use: -v opt-sA -o rlz
. The -v opt-sA
flag adds an Elias-Fano predecessor data structure to speed up binary search steps on the Suffixient Array, requiring only -e
flag. The -o rlz
option optimizes random access to the text by constructing a cache-efficient data structure based on the relative Lempel-Ziv (RLZ) parsing.
Best performance (any text) + minimal space: For an efficient sA-index that works with any ASCII text, use: -v sA -o lz77
. With this configuration the index will execute the classic binary search on the suffixient array. The random access text oracle will be implemented as a LZ77-compressed text. This configuration is also the one providing the minimal space usage for this implementation.
Best performance (DNA) + high space: If you are working with a small DNA texts and you are interested in the best performance you can construct the sA-index with the -v opt-sA -o bitpacked-text -e 100
configuration. The random access text oracle will be implemented as a simple bitpacked representation of the text stored in the internal memory, minimizing the random access times.
// Construct a smallest suffixient set using the PFP compressed-space one-pass algorithm
python3 build/suffixient-set.py -a PFP data/yeast.txt
// Construct the baseline suffixient-array index (sA-index)
python3 build/suffixient-array-index.py --build-index data/yeast.txt
// Run locate one occurrence and find MEMs queries
python3 build/suffixient-array-index.py --locate-one-occ -p data/yeast_patterns.fasta data/yeast.txt
python3 build/suffixient-array-index.py --find-mems -p data/paper_patterns.fasta data/paper_example.txt
The suffixient-set.py
interface generates two output files: prefix.suff
and prefix.suffixient-array.log
.
You can modify the prefix using the --output
flag.
prefix.suff
: Contains the suffixient array stored in binary format, using 5 bytes per element.prefix.suffixient-array.log
: Contains the standard output of the C++ binaries in plain text format.
The suffixient-array-index.py
interface generates a log file .suffixient-array.log
and three types of output files, all prefixed by input
:
-
Suffixient-array files:
Three files containing the binary serialized version of the suffixient-array search data structures:.sA
: binary search on the suffixient array (using-v sA
parameter)..opt_sA
: Elias-Fano optimized binary search on the sA (using-v opt-sA
parameter, DNA only)..pa
: binary search on the full Prefix Array (using-v PA
parameter).
-
Oracle files:
Two files containing the binary serialized versions of the random-access text oracles:.lz77
: LZ77-compressed text (using-o lz77
parameter)..rlz
: Relative LZ-compressed text (using-o rlz
parameter, DNA only)..bitpacked
: Bitpacked text using 2 bits per character (using-o rlz
parameter, DNA only).
-
Query files:
Output of thelocate-one-occurrence
(.occs
) andfind-mems
(.mems
) queries in text format:.occs
: stores pairs containing a position in the text and the length of the longest matching pattern prefix..mems
: stores triples containing a starting position in the pattern, the MEM length, and the corresponding position in the text.
You can download the datasets we used to test and evaluate the sA-index at the following link: https://github.com/regindex/suffixient-array/releases/download/datasets-bio-v1.0/biological.7z. You can download the Pizza&Chili datasets from: https://pizzachili.dcc.uchile.cl/repcorpus/real/.
Below is a list of external software resources used in this software.
[1] Davide Cenzato, Francisco Olivares, Nicola Prezza: On Computing the Smallest Suffixient Set. SPIRE 2024: 73-87 (go to the paper)
[2] Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, Nicola Prezza: Suffixient Arrays: a New Efficient Suffix Array Compression Technique. CoRR abs/2407.18753 (2025) (go to the paper)
If you use \texttt{suffixient-array} in an academic setting, please cite this work as follows:
@article{suffixient-array-2025,
author = {Davide Cenzato and
Lore Depuydt and
Travis Gagie and
Sung-Hwan Kim and
Giovanni Manzini and
Francisco Olivares and
Nicola Prezza},
title = {Suffixient Arrays: a New Efficient Suffix Array Compression Technique},
journal = {CoRR},
volume = {abs/2407.18753},
year = {2025},
doi = {10.48550/ARXIV.2407.18753}
}
@inproceedings{suffixient-set-2024,
author = {Davide Cenzato and
Francisco Olivares
and Prezza, Nicola},
title = {On Computing the Smallest Suffixient Set},
year = {2024},
doi = {10.1007/978-3-031-72200-4_6},
booktitle = {Proceedings of the 31st International Symposium on String Processing and Information Retrieval, SPIRE 2024},
pages = {73–87}
}
If you notice any bugs, please feel free to report them by opening a Git issue or by contacting us at davidecenzato Unive email.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon Europe research and innovation programme, project REGINDEX, grant agreement No. 101039208.