Skip to content

Commit

Permalink
Merge pull request #256 from pirovc/dev
Browse files Browse the repository at this point in the history
ganon v1.7.0
  • Loading branch information
pirovc authored Aug 1, 2023
2 parents 941874c + ee24008 commit 670d003
Show file tree
Hide file tree
Showing 33 changed files with 787 additions and 230 deletions.
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# =============================================================================

cmake_minimum_required( VERSION 3.10 FATAL_ERROR )
project( ganon VERSION 1.6.0 LANGUAGES CXX )
project( ganon VERSION 1.7.0 LANGUAGES CXX )

# -----------------------------------------------------------------------------
# build setup
Expand Down
42 changes: 32 additions & 10 deletions docs/classification.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,46 @@
# Classification

`ganon classify` will match single and/or paired-end reads against one or [more databases](#multiple-and-hierarchical-classification), for example:
`ganon classify` will match single and/or paired-end sets of reads against one or [more databases](#multiple-and-hierarchical-classification).
By default, parameters are optimized for **taxonomic profiling**.

Example:

```bash
ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32
```

`ganon report` will be automatically executed after classification and a report will be created (`.tre`).
`ganon report` will be automatically executed after `ganon classify` and a [report will be created `.tre`](../outputfiles/#ganon-report).

ganon can generate both taxonomic profiling and binning results with `ganon classify` + `ganon report`. Please choose the parameters according to your application.
ganon can perform **taxonomic profiling** and/or **binning** (one tax. assignment for each read) at a taxonomic, strain or sequence level with `ganon classify` + `ganon report`. Some guidelines are listed below, please choose the parameters according to your application:

### Profiling

`ganon classify` is set-up by default to perform taxonomic profiling. It uses:

- strict `--rel-cutoff` and `--rel-filter` values (`0.75` and `0`, respectively)
- `--min-count 0.0001` (0.01%) on `ganon report` to exclude low abundant groups
- `--report-type abundance` on `ganon report` to generate taxonomic abundances, re-distributing read counts and correcting for genome sizes
- strict thresholds: `--rel-cutoff 0.75` and `--rel-filter 0`

`ganon report` will automatically run after classification with:

- `--min-count 0.005` (0.5%) to exclude low abundant taxa
- `--report-type abundance` to generate taxonomic abundances, re-distributing read counts and correcting for genome sizes

!!! Note
`ganon report` can be used independently from `ganon classify` with the output file `.rep`

### Binning

To achieve better results for binning reads to specific references, ganon can be configured with:
To achieve better results for taxonomic binning or sequence classification, ganon can be configured with:

- `--output-all` and `--output-lca` to write `.all` `.lca` files for binning results
- less strict `--rel-cutoff` and `--rel-filter` values (e.g. `0.25` and `0.1`, respectively)
- activate the `--reassign` on `ganon classify` (or use the `ganon reassign` procedure) to apply a EM algorithm, re-assigning reads with LCA matches to most probable target (`--level` the database was built)
- activate the `--reassign` to apply an EM algorithm, re-assigning reads with LCA matches to one most probable target (defined by `--level` in the build procedure). In this case, the `.all` file will be re-generated with one assignment per read.

!!! Note
`ganon reassign` can be used independently from `ganon classify` with the output file `.rep` and `.all`

!!! tip
Higher `--kmer-size` values on `ganon build` can also improve read binning sensitivity
Database parameters can also influence your results. Lower `--max-fp` (e.g. 0.1, 0.001) and higher `--kmer-size` (e.g. `23`, `27`) will improve sensitivity of your results at cost of a larger database and memory usage


## Multiple and Hierarchical classification

Expand Down Expand Up @@ -128,4 +141,13 @@ For databases built with `--window-size`, the relative values are not based on t

A different `cutoff` can be set for every database in a multiple or hierarchical database classification. A different `filter` can be set for every level of a hierarchical database classification.

Note that reads that remain with only one reference match (after `cutoff` and `filter` are applied) are considered a unique match.
Note that reads that remain with only one reference match (after `cutoff` and `filter` are applied) are considered a unique match.

### False positive of a query (--fpr-query)

ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by `--max-fp` when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The `--fpr-query` will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure.

By default, `--fpr-query 1e-5` is used and it is applied after the `--rel-cutoff` and `--rel-filter`. Values between `1e-3` and `1e-10` are recommended. This threshold becomes more important when building smaller databases with higher `--max-fp`, assuring that the false positive is under control. In this case however, you may notice a in sensitivity of your results.

!!! Note
The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. “Fast Search of Thousands of Short-Read Sequencing Experiments.” Nature Biotechnology 34, no. 3 (2016): 1–6. https://doi.org/10.1038/nbt.3442.
40 changes: 26 additions & 14 deletions docs/custom_databases.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --nc
!!! note
The example above uses [genome_updater](https://github.com/pirovc/genome_updater) to download files

### BLAST databases (nt, env_nt, nt_prok, ...)
### BLAST databases (nt env_nt nt_prok ...)

BLAST databases. [Website](https://blast.ncbi.nlm.nih.gov/Blast.cgi)/[FTP](https://ftp.ncbi.nlm.nih.gov/blast/db/).

Expand All @@ -199,24 +199,36 @@ The example below extracts sequences and information from a BLAST db to build a
```bash
# Define BLAST db
db="16S_ribosomal_RNA"
threads=8

# Download BLAST db
wget -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}*.tar.gz"
# Using 12 threads: curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep "${db}.*.tar.gz$" | xargs -P 12 -I{} wget -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/{}"
# Download BLAST db - re-run this command many times until all finish (no more output)
curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep "^${db}\..*tar.gz$" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/{}"

# Merge and extract BLAST db files
cat "${db}"*.tar.gz | tar xvfz - > ex_files.txt
ex_file=$(head -n 1 ex_files.txt)
dbprefix="${ex_file%.*}"
# OPTIONAL Download and check MD5
wget -O - -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\.*tar.gz.md5" > "${db}.md5"
md5sum "${db}".*tar.gz > "${db}_downloaded.md5"
diff -s <(sort -k 2,2 "${db}.md5") <(sort -k 2,2 "${db}_downloaded.md5") # Should print "Files /dev/fd/xx and /dev/fd/xx are identical"

# Generate sequences from BLAST db
blastdbcmd -entry all -db "${dbprefix}" -out "${db}.fna"
# Extract BLAST db files, if successful, remove .tar.gz
ls "${db}"*.tar.gz | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > "${db}_extracted_files.txt"

# Generate ganon input file
blastdbcmd -entry all -db "${dbprefix}" -outfmt "%a %X" | awk -v file="${db}.fna" '{print file"\t"$1"\t"$2 }' > "${db}_ganon_input_file.tsv"
# Create folder to write sequence files (split into 10 sub-folders)
seq 0 9 | xargs -i mkdir -p "${db}"/{}

# This command extracts sequences from the blastdb and writes them into taxid specific files
# It also generates the --input-file for ganon
blastdbcmd -entry all -db "${db}" -outfmt "%a %T %s" | \
awk -v db="$(realpath ${db})" '{file=db"/"substr($2,1,1)"/"$2".fna"; print ">"$1"\n"$3 >> file; print file"\t"$2"\t"$2}' | \
sort | uniq > "${db}_ganon_input_file.tsv"

# Build ganon database
ganon build-custom --input-file "${db}_ganon_input_file.tsv" --db-prefix "${db}" --input-target sequence --level leaves --threads 32
ganon build-custom --input-file "${db}_ganon_input_file.tsv" --db-prefix "${db}" --level species --threads 12

# Delete extracted files and auxiliary files
cat "${db}_extracted_files.txt" | xargs rm
rm "${db}_extracted_files.txt" "${db}.md5" "${db}_downloaded.md5"
# Delete sequences
rm -rf "${db}" "${db}_ganon_input_file.tsv"
```

!!! note
Expand All @@ -234,7 +246,7 @@ ganon build-custom --input output_folder_genome_updater/version/ --input-recursi

### False positive and size (--max-fp, --filter-size)

ganon indices are based on bloom filters and can have false positive matches. This can be controlled with `--max-fp` parameter. The lower the `--max-fp`, the less chances of false positives matches on classification, but the larger the database size will be. For example, with `--max-fp 0.01` the database will be build so any target (defined by `--level`) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of the reads) is higher but directly affected.
ganon indices are based on bloom filters and can have false positive matches. This can be controlled with `--max-fp` parameter. The lower the `--max-fp`, the less chances of false positives matches on classification, but the larger the database size will be. For example, with `--max-fp 0.01` the database will be build so any target (defined by `--level`) will have 1 in a 100 change of reporting a false k-mer match. [The false positive of the query](../classification/#false-positive-of-a-query-fpr-query) (all k-mers of a read) will be way lower, but directly affected by this value.

Alternatively, one can set a specific size for the final index with `--filter-size`. When using this option, please observe the theoretic false positive of the index reported at the end of the building process.

Expand Down
39 changes: 20 additions & 19 deletions docs/default_databases.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,15 @@ NCBI RefSeq and GenBank repositories are common resources to obtain reference se

| RefSeq | # assemblies | Size (GB) * | |
|---|---|---|---|
| Complete | 295219 | 350 - 500 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs`</details> |
| Complete | 295219 | 160 - 500 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs`</details> |
| One assembly per species | 52779 | 40 - 98 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater "-A 'species:1'" --db-prefix abfv_rs_t1s`</details> |
| Complete genomes (higher quality) | 44121 | 19 - 64 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg`</details> |
| One assembly per species of complete genomes | 19713 | 8 - 27 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes "-A 'species:1'" --db-prefix abfv_rs_cg_t1s`</details> |
| One representative assembly per species | 18073 | 21 - 35 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater "-c 'representative genome'" --db-prefix abfv_rs_rg`</details> |
| One representative assembly per species | 18073 | 21 - 35 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg`</details> |

| GenBank | # assemblies | Size (GB) * | |
|---|---|---|---|
| Complete | 1595845 | | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb`</details> |
| Complete | 1595845 | - | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb`</details> |
| One assembly per species | 99505 | 91 - 420 | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater "-A 'species:1'" --db-prefix abfv_gb_t1s`</details> |
| Complete genomes (higher quality) | 92917 | 24 - 132 | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg`</details> |
| One assembly per species of complete genomes | 34497 | 10 - 34 | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes "-A 'species:1'" --db-prefix abfv_gb_cg_t1s`</details> |
Expand Down Expand Up @@ -124,18 +124,27 @@ genome_updater.sh -e assembly_summary.txt -f "genomic.fna.gz" -o recovered_files

## Reducing database size

### False positive

A higher `--max-fp` value will generate a smaller database but with a higher number of false positive matches on classification. [More details](../custom_databases/#false-positive-and-size-max-fp-filter-size). Values between `0.001` (0.1%) and `0.3` (30%) are generally used.

!!! hint
When using higher `--max-fp` values, more false positive results may be generated. This can be filtered with the `--fpr-query` parameter in `ganon classify`

### Fixed size

A fixed size for the database filter can be defined with `--filter-size`. The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. [More details](../custom_databases/#false-positive-and-size-max-fp-filter-size).

### HIBF

The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the Interleaved Bloom Filter (IBF) and provides *smaller* databases with *faster* query times ([pre-print](https://www.biorxiv.org/content/10.1101/2022.08.01.502266v1)). However, the HIBF takes longer to build and has less flexibility regarding size control. The HIBF can be generated in `ganon build` and `ganon build-custom` with the `--hibf` parameter.
The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates *smaller* databases with *faster* query times ([article](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02971-4)). However, the HIBF takes longer to build and has less flexibility regarding size and further options in ganon. The HIBF can be generated in `ganon build` and `ganon build-custom` with the `--hibf` parameter.

**This is the best method to reduce large database sizes and to achieve faster classification speed.**
Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. A recommended value for high sensitivity is 1% (`--hibf --max-fp 0.001`).

!!! hint
- For larger reference sets, huge amount of reads to query or production level analysis -> HIBF
- For quick build and analysis with smaller data -> IBF (default)
- For large unbalanced reference sets, lots of reads to query -> HIBF
- For quick build and more flexibility -> IBF (default)

!!! warning
[raptor (v3.0.0)](https://github.com/seqan/raptor/releases/tag/raptor-v3.0.0) has to be installed to build databases with `--hibf`

### Top assemblies

Expand All @@ -146,14 +155,6 @@ RefSeq and GenBank are highly biased toward some few organisms. This means that
- `ganon build --genome-updater "-A 'species:1'"` will select one assembly for each species
- `ganon build --genome-updater "-A 'genus:3'"` will select three assemblies for each genus

### False positive

A higher `--max-fp` value will generate a smaller database but with a higher number of false positive matches on classification. [More details](../custom_databases/#false-positive-and-size-max-fp-filter-size). Values between `0.01` and `0.3` are generally used.

### Fixed size

A fixed size for the database filter can be defined with `--filter-size`. The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. [More details](../custom_databases/#false-positive-and-size-max-fp-filter-size).

### Mode

`--mode` offers 5 different categories to build a database controlling the trade-off between size and classification speed.
Expand All @@ -163,7 +164,7 @@ A fixed size for the database filter can be defined with `--filter-size`. The sm
- `fast` or `fastest`: create bigger databases with faster classification speed

!!! Warning
If `--filter-size` is used, `smaller` and `smallest` to the false positive and not to the database size which is fixed.
If `--filter-size` is used, `smaller` and `smallest` refers to the false positive and not to the database size (which is fixed).

### Split databases

Expand All @@ -183,7 +184,7 @@ Define how much unique information is stored in the database. [More details](../

### Example

Besides the huge benefits of using [HIBF](#hibf) and specific sub-sets of big repositories shown on the [default databases table](#commonly-used-sub-sets), examples of other reduction strategies (without `--hibf`) can be seen below:
Besides the benefits of using [HIBF](#hibf) and specific sub-sets of big repositories shown on the [default databases table](#commonly-used-sub-sets), examples of other reduction strategies (without `--hibf`) can be seen below:

*RefSeq archaeal complete genomes from 20230505*

Expand Down
Loading

0 comments on commit 670d003

Please sign in to comment.