Skip to content

FOI-Bioinformatics/taxbencher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FOI-Bioinformatics/taxbencher

GitHub Actions CI Status GitHub Actions Linting StatusCite with Zenodo nf-test

Nextflow nf-core template version run with conda run with docker run with singularity Launch on Seqera Platform

Introduction

FOI-Bioinformatics/taxbencher is a bioinformatics pipeline that benchmarks taxonomic classifiers by evaluating their predictions against known ground truth. It accepts standardized taxpasta profiles (typically generated by nf-core/taxprofiler) and uses the CAMI OPAL evaluation framework to compute comprehensive performance metrics. The pipeline produces HTML reports with precision, recall, F1 scores, UniFrac distances, and other metrics to help researchers compare and select the best taxonomic classification tools for their data.

Pipeline steps:

  1. Convert taxpasta profiles to CAMI Bioboxes format (TAXPASTA_TO_BIOBOXES)
  2. Evaluate predictions against gold standard (OPAL)
  3. Aggregate results (MultiQC)

Features

  • Standardized Format Conversion: Converts taxpasta TSV to CAMI Bioboxes format
  • Comprehensive Metrics: Precision, recall, F1, UniFrac, Shannon diversity, Bray-Curtis, and more
  • Built-in Validation: Pre-flight validation tools for taxpasta and bioboxes formats
  • nf-core Compliance: Built using nf-core template with best practices
  • Full Container Support: 100% Docker coverage via Seqera Wave, plus Singularity and Conda
  • Extensive Testing: Full nf-test suite with validated test data
  • Integration Ready: Works seamlessly with nf-core/taxprofiler outputs

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample_id,label,classifier,taxpasta_file,taxonomy_db
sample1,sample1_kraken2,kraken2,results/taxprofiler/sample1_kraken2.tsv,NCBI
sample1,sample1_metaphlan,metaphlan,results/taxprofiler/sample1_metaphlan.tsv,NCBI
sample2,sample2_kraken2,kraken2,results/taxprofiler/sample2_kraken2.tsv,NCBI
sample2,sample2_metaphlan,metaphlan,results/taxprofiler/sample2_metaphlan.tsv,NCBI

Each row represents a taxpasta profile from a specific classifier.

  • sample_id: Biological sample identifier (groups profiles from the same biological sample)
  • label: Unique identifier for this taxonomic profile
  • classifier: Taxonomic classifier tool name
  • taxpasta_file: Path to taxpasta profile or raw profiler output
  • taxonomy_db: Taxonomy database (optional, default: NCBI)

Profiles with the same sample_id are evaluated together in a single OPAL run, enabling per-sample comparative analysis.

Note

Taxpasta files are generated by nf-core/taxprofiler. Run taxprofiler first to generate taxonomic profiles in standardized format.

Tip

Always validate your input files before running the pipeline:

# Validate taxpasta files
python3 bin/validate_taxpasta.py sample1_kraken2.tsv

# Validate gold standard (CRITICAL - catches column mismatches and unsupported ranks)
python3 bin/validate_bioboxes.py gold_standard.bioboxes

# If validation fails, automatically fix common issues:
python3 bin/fix_gold_standard.py \
  -i gold_standard.bioboxes \
  -o gold_standard_fixed.bioboxes \
  -s sample_id

See Gold Standard Troubleshooting for detailed validation and fixing guide.

Now, you can run the pipeline using:

nextflow run FOI-Bioinformatics/taxbencher \
   -profile <docker/singularity/conda/.../institute> \
   --input samplesheet.csv \
   --gold_standard gold_standard.bioboxes \
   --outdir <OUTDIR>

Recommended Profiles

Platform Recommended Profile Notes
Linux x86_64 docker,wave Best performance, full functionality, 100% module coverage
Linux ARM64 conda Docker/Singularity containers are AMD64 only
macOS (Intel) docker,wave Full functionality with Wave containers
macOS (Apple Silicon) conda ⚠️ Docker has limitations (see below)
HPC/Cluster singularity,wave or conda Depends on cluster configuration

Warning

Apple Silicon (M1/M2/M3) Limitations: When using -profile docker,wave on Apple Silicon Macs, the MultiQC step may fail with "Illegal instruction" errors due to AMD64/ARM64 architecture incompatibility. The core benchmarking processes (TAXPASTA_STANDARDISE, TAXPASTA_TO_BIOBOXES, OPAL) work correctly and produce all evaluation metrics. Recommended solution: Use -profile conda on Apple Silicon for full compatibility.

Profile Details

Conda (Recommended for macOS):

nextflow run FOI-Bioinformatics/taxbencher \
   -profile conda \
   --input samplesheet.csv \
   --gold_standard gold_standard.bioboxes \
   --outdir results

Docker with Wave (Recommended for Linux/Intel Mac):

nextflow run FOI-Bioinformatics/taxbencher \
   -profile docker,wave \
   --input samplesheet.csv \
   --gold_standard gold_standard.bioboxes \
   --outdir results

Wave automatically builds containers for modules requiring scientific Python packages (TAXPASTA_TO_BIOBOXES, COMPARATIVE_ANALYSIS)

Singularity with Wave (HPC/Linux):

nextflow run FOI-Bioinformatics/taxbencher \
   -profile singularity,wave \
   --input samplesheet.csv \
   --gold_standard gold_standard.bioboxes \
   --outdir results

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Documentation

For detailed information, see:

Credits

FOI-Bioinformatics/taxbencher was originally written by Andreas Sjödin.

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Key tools used in this pipeline:

  • OPAL (CAMI taxonomic profiling evaluation)

    Meyer F, Hofmann P, Belmann P, et al. AMBER: Assessment of Metagenome BinnERs. GigaScience. 2018;7(6). doi: 10.1093/gigascience/giy069

  • taxpasta (Taxonomic profile standardization)

    Beber ME, Borry M, Stamouli S, Fellows Yates JA. taxpasta: TAXonomic Profile Aggregation and STAndardisation. Journal of Open Source Software. 2023;8(87):5627. doi: 10.21105/joss.05627

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

taxbencher: A Nextflow framework for benchmarking metagenomic taxonomic classifiers and profilers.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •