Gracken is a tool for creating phylogenetic trees from Bracken/Kraken2 reports. Both NCBI and GTDB taxonomies (see Struo2) are supported. It accomplishes this by pruning the NCBI/GTDB-provided tree using the species information from the report files. Gracken outputs a Newick-formatted tree and an OTU file with a format suitable for use with phyloseq.
Gracken was inspired by this comment on the KrakenTools repository.
If you have uv
installed, you can run gracken directly using uvx
:
uvx gracken --help
or install it using either uv
or pipx
:
# with uv
uv tool install gracken
# with pipx
pipx install gracken
# test that it works
gracken --help
Gracken requires Bracken or Kraken2 report files. For GTDB taxonomy, you'll also need the taxonomy and tree files matching the database you used.
usage: gracken [-h] [--version] --input_dir INPUT_DIR [--bac_taxonomy BAC_TAXONOMY] [--ar_taxonomy AR_TAXONOMY]
[--bac_tree BAC_TREE] [--ar_tree AR_TREE] [--out_prefix OUT_PREFIX] [--mode {bracken,kraken2}]
[--keep-spaces] [--taxonomy {gtdb,ncbi}] [--full-taxonomy]
Creates a phylogenetic tree and OTU table from Bracken/Kraken2 reports by pruning GTDB/NCBI trees
options:
-h, --help show this help message and exit
--version, -v show program version and exit
--input_dir INPUT_DIR, -i INPUT_DIR
directory containing Bracken/Kraken2 report files
--bac_taxonomy BAC_TAXONOMY
path to GTDB bacterial taxonomy file
--ar_taxonomy AR_TAXONOMY
path to GTDB archaeal taxonomy file
--bac_tree BAC_TREE path to GTDB bacterial tree file
--ar_tree AR_TREE path to GTDB archaeal tree file
--out_prefix OUT_PREFIX, -o OUT_PREFIX
prefix for output files (default: output). Creates <prefix>.tree and <prefix>.otu.csv
--mode {bracken,kraken2}, -m {bracken,kraken2}
input file format (default: bracken)
--keep-spaces keep spaces in species names (default: False)
--taxonomy {gtdb,ncbi}, -t {gtdb,ncbi}
taxonomy source to use (default: ncbi).
--full-taxonomy, -f include full taxonomy info in OTU table (default: False)
First, download the taxonomy and tree files for the GTDB release corresponding to your Bracken reports from the GTDB website. For example, if you used GTDB release 95, you would download these files: bac120_taxonomy_r95.tsv
, ar122_taxonomy_r95.tsv
, bac120_r95.tree
, and ar122_r95.tree
.
gracken --input_dir bracken_reports --bac_taxonomy bac120_taxonomy_r95.tsv --ar_taxonomy ar122_taxonomy_r95.tsv --bac_tree bac120_r95.tree --ar_tree ar122_r95.tree --out_prefix my_tree --taxonomy gtdb
This command will:
- Read Bracken reports from the
bracken_reports
directory. - Use the specified GTDB taxonomy and tree files.
- Create two output files:
my_tree.tree
(the phylogenetic tree) andmy_tree.otu.csv
(the OTU table).
In NCBI mode, the taxdump is automatically downloaded. Simply run:
gracken --input_dir bracken_reports --out_prefix my_ncbi_tree --mode bracken
This command will:
- Read Bracken reports from the
bracken_reports
directory. - Automatically download the necessary NCBI taxdump.
- Create two output files:
my_ncbi_tree.tree
(the phylogenetic tree) andmy_ncbi_tree.otu.csv
(the OTU table).
Gracken will output two files:
-
<prefix>.otu.csv
: A CSV file containing the OTU table. The first column contains the species names, and the remaining columns contain the counts for each species in each sample (smple names are derived from report file names). If you used the--full-taxonomy
option, the first columns will contain the full taxonomy for each species. -
<prefix>.tree
: A Newick-formatted tree file.
You can use the output OTU file and tree with phyloseq and ape. Here's an example R script which works regardless of whether you used --full-taxonomy
or not:
library(ape)
library(phyloseq)
# read the otu table and convert it to matrix, using the species column as row names
otu_tbl <- read.csv("output.otu.csv", stringsAsFactors = FALSE)
otu_mat <- as.matrix(otu_tbl[, (which(names(otu_tbl) == "species") + 1):ncol(otu_tbl)])
rownames(otu_mat) <- otu_tbl$species
otu_mat[is.na(otu_mat)] <- 0
otu_table_ps <- otu_table(otu_mat, taxa_are_rows = TRUE)
# load the tree
tree_ps <- phy_tree(read.tree('output.tree'))
# create the phyloseq object
ps <- phyloseq(otu_table_ps, tree_ps)