Merge pull request #256 from pirovc/dev

ganon v1.7.0
pirovc · Aug 1, 2023 · 670d003 · 670d003
2 parents 941874c + ee24008
commit 670d003
Show file tree

Hide file tree

Showing 33 changed files with 787 additions and 230 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -3,7 +3,7 @@
 # =============================================================================
 
 cmake_minimum_required( VERSION 3.10 FATAL_ERROR )
-project( ganon VERSION 1.6.0 LANGUAGES CXX )
+project( ganon VERSION 1.7.0 LANGUAGES CXX )
 
 # -----------------------------------------------------------------------------
 # build setup

diff --git a/docs/classification.md b/docs/classification.md
@@ -1,33 +1,46 @@
 # Classification
 
-`ganon classify` will match single and/or paired-end reads against one or [more databases](#multiple-and-hierarchical-classification), for example:
+`ganon classify` will match single and/or paired-end sets of reads against one or [more databases](#multiple-and-hierarchical-classification). 
+By default, parameters are optimized for **taxonomic profiling**. 
+
+Example:
 
 ```bash
 ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32
 ```
 
-`ganon report` will be automatically executed after classification and a report will be created (`.tre`).
+`ganon report` will be automatically executed after `ganon classify` and a [report will be created `.tre`](../outputfiles/#ganon-report).
 
-ganon can generate both taxonomic profiling and binning results with `ganon classify` + `ganon report`. Please choose the parameters according to your application.
+ganon can perform **taxonomic profiling** and/or **binning** (one tax. assignment for each read) at a taxonomic, strain or sequence level with `ganon classify` + `ganon report`. Some guidelines are listed below, please choose the parameters according to your application:
 
 ### Profiling
 
 `ganon classify` is set-up by default to perform taxonomic profiling. It uses:
 
- - strict `--rel-cutoff` and `--rel-filter` values (`0.75` and `0`, respectively)
- - `--min-count 0.0001` (0.01%) on `ganon report` to exclude low abundant groups
- - `--report-type abundance` on `ganon report` to generate taxonomic abundances, re-distributing read counts and correcting for genome sizes
+ - strict thresholds: `--rel-cutoff 0.75` and `--rel-filter 0`
+
+`ganon report` will automatically run after classification with:
+
+ - `--min-count 0.005` (0.5%) to exclude low abundant taxa
+ - `--report-type abundance` to generate taxonomic abundances, re-distributing read counts and correcting for genome sizes
+
+!!! Note
+    `ganon report` can be used independently from `ganon classify` with the output file `.rep`
 
 ### Binning
 
-To achieve better results for binning reads to specific references, ganon can be configured with:
+To achieve better results for taxonomic binning or sequence classification, ganon can be configured with:
 
  - `--output-all` and `--output-lca` to write `.all` `.lca` files for binning results
  - less strict `--rel-cutoff` and `--rel-filter` values (e.g. `0.25` and `0.1`, respectively)
- - activate the `--reassign` on `ganon classify` (or use the `ganon reassign` procedure) to apply a EM algorithm, re-assigning reads with LCA matches to most probable target (`--level` the database was built)
+ - activate the `--reassign` to apply an EM algorithm, re-assigning reads with LCA matches to one most probable target (defined by `--level` in the build procedure). In this case, the `.all` file will be re-generated with one assignment per read.
+
+!!! Note
+    `ganon reassign` can be used independently from `ganon classify` with the output file `.rep` and `.all`
 
 !!! tip
-    Higher `--kmer-size` values on `ganon build` can also improve read binning sensitivity
+    Database parameters can also influence your results. Lower `--max-fp` (e.g. 0.1, 0.001) and higher `--kmer-size` (e.g. `23`, `27`) will improve sensitivity of your results at cost of a larger database and memory usage
+
 
 ## Multiple and Hierarchical classification
 
@@ -128,4 +141,13 @@ For databases built with `--window-size`, the relative values are not based on t
 
 A different `cutoff` can be set for every database in a multiple or hierarchical database classification. A different `filter` can be set for every level of a hierarchical database classification.
 
-Note that reads that remain with only one reference match (after `cutoff` and `filter` are applied) are considered a unique match.
+Note that reads that remain with only one reference match (after `cutoff` and `filter` are applied) are considered a unique match.
+
+### False positive of a query (--fpr-query)
+
+ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by `--max-fp` when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The `--fpr-query` will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure. 
+
+By default, `--fpr-query 1e-5` is used and it is applied after the `--rel-cutoff` and `--rel-filter`. Values between `1e-3` and `1e-10` are recommended. This threshold becomes more important when building smaller databases with higher `--max-fp`, assuring that the false positive is under control. In this case however, you may notice a in sensitivity of your results.
+
+!!! Note
+    The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. “Fast Search of Thousands of Short-Read Sequencing Experiments.” Nature Biotechnology 34, no. 3 (2016): 1–6. https://doi.org/10.1038/nbt.3442.
diff --git a/docs/custom_databases.md b/docs/custom_databases.md
@@ -182,7 +182,7 @@ ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --nc
 !!! note
     The example above uses [genome_updater](https://github.com/pirovc/genome_updater) to download files
 
-### BLAST databases (nt, env_nt, nt_prok, ...)
+### BLAST databases (nt env_nt nt_prok ...)
 
 BLAST databases. [Website](https://blast.ncbi.nlm.nih.gov/Blast.cgi)/[FTP](https://ftp.ncbi.nlm.nih.gov/blast/db/).
 
@@ -199,24 +199,36 @@ The example below extracts sequences and information from a BLAST db to build a
 ```bash
 # Define BLAST db
 db="16S_ribosomal_RNA"
+threads=8
 
-# Download BLAST db
-wget -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}*.tar.gz"
-# Using 12 threads: curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep "${db}.*.tar.gz$" | xargs -P 12 -I{} wget -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/{}"
+# Download BLAST db - re-run this command many times until all finish (no more output)
+curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep "^${db}\..*tar.gz$" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/{}"
 
-# Merge and extract BLAST db files
-cat "${db}"*.tar.gz | tar xvfz - > ex_files.txt
-ex_file=$(head -n 1 ex_files.txt)
-dbprefix="${ex_file%.*}"
+# OPTIONAL Download and check MD5
+wget -O - -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\.*tar.gz.md5" > "${db}.md5"
+md5sum "${db}".*tar.gz > "${db}_downloaded.md5"
+diff -s <(sort -k 2,2 "${db}.md5") <(sort -k 2,2 "${db}_downloaded.md5")  # Should print "Files /dev/fd/xx and /dev/fd/xx are identical"
 
-# Generate sequences from BLAST db
-blastdbcmd -entry all -db "${dbprefix}" -out "${db}.fna"
+# Extract BLAST db files, if successful, remove .tar.gz
+ls "${db}"*.tar.gz | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > "${db}_extracted_files.txt"
 
-# Generate ganon input file
-blastdbcmd -entry all -db "${dbprefix}" -outfmt "%a %X" | awk -v file="${db}.fna" '{print file"\t"$1"\t"$2 }' > "${db}_ganon_input_file.tsv"
+# Create folder to write sequence files (split into 10 sub-folders)
+seq 0 9 | xargs -i mkdir -p "${db}"/{}
+
+# This command extracts sequences from the blastdb and writes them into taxid specific files
+# It also generates the --input-file for ganon
+blastdbcmd -entry all -db "${db}" -outfmt "%a %T %s" | \
+awk -v db="$(realpath ${db})" '{file=db"/"substr($2,1,1)"/"$2".fna"; print ">"$1"\n"$3 >> file; print file"\t"$2"\t"$2}' | \
+sort | uniq > "${db}_ganon_input_file.tsv"
 
 # Build ganon database
-ganon build-custom --input-file "${db}_ganon_input_file.tsv" --db-prefix "${db}" --input-target sequence --level leaves --threads 32
+ganon build-custom --input-file "${db}_ganon_input_file.tsv" --db-prefix "${db}" --level species --threads 12
+
+# Delete extracted files and auxiliary files
+cat "${db}_extracted_files.txt" | xargs rm
+rm "${db}_extracted_files.txt" "${db}.md5" "${db}_downloaded.md5"
+# Delete sequences
+rm -rf "${db}" "${db}_ganon_input_file.tsv"
 ```
 
 !!! note
@@ -234,7 +246,7 @@ ganon build-custom --input output_folder_genome_updater/version/ --input-recursi
 
 ### False positive and size (--max-fp, --filter-size)
 
-ganon indices are based on bloom filters and can have false positive matches. This can be controlled with `--max-fp` parameter. The lower the `--max-fp`, the less chances of false positives matches on classification, but the larger the database size will be. For example, with `--max-fp 0.01` the database will be build so any target (defined by `--level`) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of the reads) is higher but directly affected.
+ganon indices are based on bloom filters and can have false positive matches. This can be controlled with `--max-fp` parameter. The lower the `--max-fp`, the less chances of false positives matches on classification, but the larger the database size will be. For example, with `--max-fp 0.01` the database will be build so any target (defined by `--level`) will have 1 in a 100 change of reporting a false k-mer match. [The false positive of the query](../classification/#false-positive-of-a-query-fpr-query) (all k-mers of a read) will be way lower, but directly affected by this value.
 
 Alternatively, one can set a specific size for the final index with `--filter-size`. When using this option, please observe the theoretic false positive of the index reported at the end of the building process.
 

diff --git a/docs/default_databases.md b/docs/default_databases.md
@@ -31,15 +31,15 @@ NCBI RefSeq and GenBank repositories are common resources to obtain reference se
 
 | RefSeq | # assemblies | Size (GB) * |  |
 |---|---|---|---|
-| Complete | 295219 | 350 - 500 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs`</details> |
+| Complete | 295219 | 160 - 500 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs`</details> |
 | One assembly per species | 52779 | 40 - 98 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater "-A 'species:1'" --db-prefix abfv_rs_t1s`</details> |
 | Complete genomes (higher quality) | 44121 | 19 - 64 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg`</details> |
 | One assembly per species of complete genomes | 19713 | 8 - 27 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes "-A 'species:1'" --db-prefix abfv_rs_cg_t1s`</details> |
-| One representative assembly per species | 18073 | 21 - 35 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater "-c 'representative genome'" --db-prefix abfv_rs_rg`</details> |
+| One representative assembly per species | 18073 | 21 - 35 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg`</details> |
 
 | GenBank | # assemblies | Size (GB) * |  |
 |---|---|---|---|
-| Complete | 1595845 | | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb`</details> |
+| Complete | 1595845 | - | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb`</details> |
 | One assembly per species | 99505 | 91 - 420 | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater "-A 'species:1'" --db-prefix abfv_gb_t1s`</details> |
 | Complete genomes (higher quality) | 92917 | 24 - 132 | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg`</details> |
 | One assembly per species of complete genomes | 34497 | 10 - 34 | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes "-A 'species:1'" --db-prefix abfv_gb_cg_t1s`</details> |
@@ -124,18 +124,27 @@ genome_updater.sh -e assembly_summary.txt -f "genomic.fna.gz" -o recovered_files
 
 ## Reducing database size
 
+### False positive
+
+A higher `--max-fp` value will generate a smaller database but with a higher number of false positive matches on classification. [More details](../custom_databases/#false-positive-and-size-max-fp-filter-size). Values between `0.001` (0.1%) and `0.3` (30%) are generally used. 
+
+!!! hint
+    When using higher `--max-fp` values, more false positive results may be generated. This can be filtered with the `--fpr-query` parameter in `ganon classify` 
+
+### Fixed size
+
+A fixed size for the database filter can be defined with `--filter-size`. The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. [More details](../custom_databases/#false-positive-and-size-max-fp-filter-size).
+
 ### HIBF
 
-The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the Interleaved Bloom Filter (IBF) and provides *smaller* databases with *faster* query times ([pre-print](https://www.biorxiv.org/content/10.1101/2022.08.01.502266v1)). However, the HIBF takes longer to build and has less flexibility regarding size control. The HIBF can be generated in `ganon build` and `ganon build-custom` with the `--hibf` parameter.
+The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates *smaller* databases with *faster* query times ([article](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02971-4)). However, the HIBF takes longer to build and has less flexibility regarding size and further options in ganon. The HIBF can be generated in `ganon build` and `ganon build-custom` with the `--hibf` parameter.
 
-**This is the best method to reduce large database sizes and to achieve faster classification speed.**
+Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. A recommended value for high sensitivity is 1% (`--hibf --max-fp 0.001`).
 
 !!! hint
-    - For larger reference sets, huge amount of reads to query or production level analysis -> HIBF
-    - For quick build and analysis with smaller data -> IBF (default)
+    - For large unbalanced reference sets, lots of reads to query -> HIBF
+    - For quick build and more flexibility -> IBF (default)
 
-!!! warning
-    [raptor (v3.0.0)](https://github.com/seqan/raptor/releases/tag/raptor-v3.0.0) has to be installed to build databases with `--hibf`
 
 ### Top assemblies
 
@@ -146,14 +155,6 @@ RefSeq and GenBank are highly biased toward some few organisms. This means that
     - `ganon build --genome-updater "-A 'species:1'"` will select one assembly for each species
     - `ganon build --genome-updater "-A 'genus:3'"` will select three assemblies for each genus
 
-### False positive
-
-A higher `--max-fp` value will generate a smaller database but with a higher number of false positive matches on classification. [More details](../custom_databases/#false-positive-and-size-max-fp-filter-size). Values between `0.01` and `0.3` are generally used.
-
-### Fixed size
-
-A fixed size for the database filter can be defined with `--filter-size`. The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. [More details](../custom_databases/#false-positive-and-size-max-fp-filter-size).
-
 ### Mode
 
 `--mode` offers 5 different categories to build a database controlling the trade-off between size and classification speed.
@@ -163,7 +164,7 @@ A fixed size for the database filter can be defined with `--filter-size`. The sm
 - `fast` or `fastest`: create bigger databases with faster classification speed
 
 !!! Warning
-    If `--filter-size` is used, `smaller` and `smallest` to the false positive and not to the database size which is fixed.
+    If `--filter-size` is used, `smaller` and `smallest` refers to the false positive and not to the database size (which is fixed). 
 
 ### Split databases
 
@@ -183,7 +184,7 @@ Define how much unique information is stored in the database. [More details](../
 
 ### Example
 
-Besides the huge benefits of using [HIBF](#hibf) and specific sub-sets of big repositories shown on the [default databases table](#commonly-used-sub-sets), examples of other reduction strategies (without `--hibf`) can be seen below:
+Besides the benefits of using [HIBF](#hibf) and specific sub-sets of big repositories shown on the [default databases table](#commonly-used-sub-sets), examples of other reduction strategies (without `--hibf`) can be seen below:
 
 *RefSeq archaeal complete genomes from 20230505*