Rapid & standardized annotation of bacterial genomes, MAGs & plasmids using protein structural information
Baktfold is a sensitive annotation tool for bacterial genomes, MAGs & plasmids genomes using protein structural homology.
Baktfold is very similar to Phold but goes beyond phages to bacterial annotation. Baktfold takes all hypothetical proteins from Bakta's output and uses the ProstT5 protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by Foldseek. Foldseek is then used to search these against a series of databases (SwissProt, AlphaFold Database non-singleton clusters, PDB and CATH).
Additionally, instead of using ProstT5, you can specify protein structures that you have pre-computed for your hypothetical proteins.
You can also specify custom databases to search against using --custom-db
Baktfold is currently under active development. We would welcome any and all feedback (especially bugs) via Issues
If you don't want to install Baktfold locally, you can run it without any code using the Google Colab notebook
The best way to install Baktfold is using conda, as this will install Foldseek (the only non-Python dependency) along with the Python dependencies We would highly recommend installing Conda via Miniforge:
conda create -n baktfoldENV -c conda-forge -c bioconda baktfold To utilise phold with GPU, a GPU compatible version of pytorch must be installed (default: CPU-only version):
conda create -n baktfoldENV -c conda-forge -c bioconda baktfold pytorch=*=cuda*If you have a Mac with M-series Apple Silicon, you may need to install a particular version of Pytorch to utilise GPU-acceleration. The same is true if you use other non-NVIDIA e.g. AMD GPUs. See this link for some more detail and further links
You can also install Baktfold using Pip:
pip install baktfoldYou will need to have Foldseek (ideally v10.941cd33) installed and available in the $PATH.
You can install the latest version of Baktfold with potentially untested and unreleased changes into a conda environment using conda as follows:
conda create -n baktfoldENV foldseek
conda activate baktfoldENV
git clone https://github.com/gbouras13/baktfold.git
cd baktfold
pip install .
baktfold --helpTo download and install Baktfold's databases (use as many threads with -t as you can to speed up downloading):
baktfold install -d baktfold_db -t 8If you have an NVIDIA GPU, you will need to format the database to allow it to use Foldseek-GPU with --foldseek-gpu.
Note: you can do this after downloading the database with the above command (it won't redownload the database, only do the relevant Foldseek database padding)
baktfold install -d baktfold_db --foldseek-gpuFirst, you need to run Bakta and use the resulting .json file as input for Baktfold.
We will add other input formats eventually, but we would always recommend running Bakta first, for a comprehensive genome annotation.
Running Baktfold on Bakta results (using a dummy test example JSON assembly.json file:
# default (CPU-only or non-NVIDIA GPU e.g. Mac or AMD)
baktfold run -i tests/test_data/assembly_bakta_output/assembly.json -o baktfold_output -f -t 8 -d baktfold_db
# with Nvidia GPU
baktfold run -i tests/test_data/assembly_bakta_output/assembly.json -o baktfold_output -f -t 8 -d baktfold_db --foldseek-gpuRunning Baktfold on protein sequences (using a dummy test example Fasta .faa file):
# default (CPU-only or non-NVIDIA GPU e.g. Mac or AMD)
baktfold proteins -i tests/test_data/assembly.hypotheticals.faa -o baktfold_proteins_output -f -t 8 -d baktfold_db
# with Nvidia GPU
baktfold proteins -i tests/test_data/assembly.hypotheticals.faa -o baktfold_proteins_output -f -t 8 -d baktfold_db --foldseek-gpuNote that this can be any .faa. It does not have to be the output of Bakta.
The two most useful commands are baktfold run and baktfold proteins
baktfold runaccepts a Bakta json file as input, and by default, it will annotate all hypothetical CDS and return a variety of Bakta-like compliant output formats. All other annotations will be inherited from the Bakta outputbaktfold proteinsaccepts a protein FASTA.faaformat file as input. It will annotate all protein sequences and return a variety ofbakta_proteins-like output formatsbaktfold predictandbaktfold comparesplitbaktfold runinto the ProstT5 and Foldseek modules, whilebaktfold proteins-predictandbaktfold proteins-comparedo the same forbaktfold proteins(useful if you have non-NVIDIA GPUs)
It is recommend you run Baktfold with a GPU if you can. If you do not have a GPU, Baktfold will still run, but the ProstT5 step will be fairly slow. If you have a NVIDIA GPU, you can also use the --foldseek-gpu parameter to accelerate Foldseek further
Usage: baktfold [OPTIONS] COMMAND [ARGS]...
Main command line interface for baktfold.
Returns: None
Examples: >>> main_cli() None
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
Commands:
autotune Determines optimal batch size for 3Di prediction with...
citation Print the citation(s) for this tool
compare Runs Foldseek vs baktfold db
convert-euk (Experimental) Converts eukaryotic GenBank to Bakta...
convert-prokka Converts Prokka GenBank to Bakta format json
createdb Creates foldseek DB from AA FASTA and 3Di FASTA input...
install Installs ProstT5 model and baktfold database
predict Uses ProstT5 to predict 3Di tokens - GPU recommended
proteins baktfold proteins-predict then comapare all in one -...
proteins-compare Runs Foldseek vs baktfold db on proteins input
proteins-predict Runs ProstT5 on a multiFASTA input - GPU recommended
run baktfold predict then comapare all in one - GPU...Usage: baktfold run [OPTIONS]
baktfold predict then comapare all in one - GPU recommended
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Bakta Genbank format or
Bakta JSON format [required]
-o, --output PATH Output directory [default: output_baktfold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: baktfold]
-d, --database TEXT Specific path to installed baktfold database
-f, --force Force overwrites the output directory
--autotune Run autotuning to detect and automatically
use best batch size for your hardware.
Recommended only if you have a large dataset
(e.g. thousands of proteins), or else
autotuning will add rather than save runtime.
--batch-size INTEGER batch size for ProstT5. 1 is usually fastest.
[default: 1]
--cpu Use cpus only.
--omit-probs Do not output per residue 3Di probabilities
from ProstT5. Mean per protein 3Di
probabilities will always be output.
--save-per-residue-embeddings Save the ProstT5 embeddings per resuide in a
h5 file
--save-per-protein-embeddings Save the ProstT5 embeddings as means per
protein in a h5 file
--mask-threshold FLOAT Masks 3Di residues below this value of
ProstT5 confidence for Foldseek searches
[default: 25]
-e, --evalue FLOAT Evalue threshold for Foldseek [default:
1e-3]
-s, --sensitivity FLOAT Sensitivity parameter for foldseek [default:
9.5]
--keep-tmp-files Keep temporary intermediate files,
particularly the large foldseek_results.tsv
of all Foldseek hits
--max-seqs INTEGER Maximum results per query sequence allowed to
pass the prefilter. You may want to reduce
this to save disk space for enormous datasets
[default: 1000]
--ultra-sensitive Runs baktfold with maximum sensitivity by
skipping Foldseek prefilter. Not recommended
for large datasets.
--extra-foldseek-params TEXT Extra foldseek search params
--custom-db TEXT Path to custom database
--foldseek-gpu Use this to enable compatibility with
Foldseek-GPU search acceleration
--custom-annotations PATH Custom Foldseek DB annotations, 2 column tsv.
Column 1 matches the Foldseek headers, column
2 is the description.
--euk Eukaryotic input genome.
--fast Skips Foldseek search against AFDB Clusters.
-a, --all-proteins annotate all proteins (not just
hypotheticals)The majority of outputs match Bakta. Specifically, all the format compliant outputs match Bakta's.
The differences are:
<prefix>.inference.tsvis different compared to Bakta.- In Baktfold, this file gives a quick overview of the different Baktfold databases for which the query protein has a hit (if any)
For example:
ID Length Product Swissprot AFDBClusters PDB CATH
MEGJMNBEGN_27 162 HTH-type quorum-sensing regulator RhlR swissprot_P54292 afdbclusters_A0A9E1VSB0 pdb_5l09 cath_3sztB01
MEGJMNBEGN_30 68 hypothetical protein
MEGJMNBEGN_70 94 hypothetical protein afdbclusters_A0A1I3V7E0 <prefix>_<database>_tophit.tsvfiles give the detailed Foldseek alignment information for each tophit found for each database.
For example:
query target bitscore fident evalue qStart qEnd qLen qCov tStart tEnd tLen tCov
MEGJMN_070 AF-A0A1I3V7E0-F1-model_v6 292 0.41 2.619e-06 1 91 93 0.97 1 95 99 0.95- The full Foldseek search outputs are not kept by default (only tophits). You can keep the full Foldseek search TSVs using
--keep-tmp-files. They will be calledfoldseek_results_<database>.tsv. baktfold_3di.fastawhich gives the 3Di tokens for each input CDSbaktfold_prostT5_3di_mean_probabilities.csvandbaktfold_prostT5_3di_all_probabilities.json, which give some score of the confidence ProstT5 has in its predictions. You can disable this output with--omit-probs- Baktfold does not have plotting functionality like Bakta (yet)
As Baktfold inherits annotations and related conceptual terms from Bakta. Hence, we kindly refer to Bakta's readme. In addition, Baktfold introduces one conceptual term:
PSTC: protein structure clusters. These comprise of structure-based annotations to any of Baktfold's databases
A manuscript describing Baktfold is in preparation
Please, be sure to cite the following core dependencies - citing all bioinformatics tools that you use helps us, so helps you get better bioinformatics tools:
- Foldseek - (https://github.com/steineggerlab/foldseek) van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist C, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology (2023), doi:10.1038/s41587-023-01773-0
- ProstT5 - (https://github.com/mheinzinger/ProstT5) Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Martin Steinegger, Burkhard Rost. ProstT5: Bilingual language model for protein sequence and structure. NAR Genomics and Bioinformatics (2024) doi:10.1101/2023.07.23.550085
Please also consider citing these databases where relevant:
- AFDB/SwissProt - Mihaly Varadi, Damian Bertoni, Paulyna Magana, Urmila Paramval, Ivanna Pidruchna, Malarvizhi Radhakrishnan, Maxim Tsenkov, Sreenath Nair, Milot Mirdita, Jingi Yeo, Oleg Kovalevskiy, Kathryn Tunyasuvunakool, Agata Laydon, Augustin Žídek, Hamish Tomlinson, Dhavanthi Hariharan, Josh Abrahamson, Tim Green, John Jumper, Ewan Birney, Martin Steinegger, Demis Hassabis, Sameer Velankar, AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences, Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D368–D375, https://doi.org/10.1093/nar/gkad1011
- CATH - Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH--a hierarchic classification of protein domain structures. Structure. 1997 Aug 15;5(8):1093-108. doi: 10.1016/s0969-2126(97)00260-8. PMID: 9309224.
- PDB - H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The Protein Data Bank (2000) Nucleic Acids Research 28: 235-242 https://doi.org/10.1093/nar/28.1.235