Baktfold

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids using protein structural information

Baktfold is a sensitive annotation tool for bacterial genomes, MAGs & plasmids genomes using protein structural homology.

Baktfold is very similar to Phold but goes beyond phages to bacterial annotation. Baktfold takes all hypothetical proteins from Bakta's output and uses the ProstT5 protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by Foldseek. Foldseek is then used to search these against a series of databases (SwissProt, AlphaFold Database non-singleton clusters, PDB and CATH).

Additionally, instead of using ProstT5, you can specify protein structures that you have pre-computed for your hypothetical proteins.

You can also specify custom databases to search against using --custom-db

Baktfold is currently under active development. We would welcome any and all feedback (especially bugs) via Issues

Google Colab Notebook

If you don't want to install Baktfold locally, you can run it without any code using the Google Colab notebook

# default (CPU-only or non-NVIDIA GPU e.g. Mac or AMD)
baktfold run -i tests/test_data/assembly_bakta_output/assembly.json  -o baktfold_output -f -t 8 -d baktfold_db   
# with Nvidia GPU
baktfold run -i tests/test_data/assembly_bakta_output/assembly.json  -o baktfold_output -f -t 8 -d baktfold_db   --foldseek-gpu

Running Baktfold on protein sequences (using a dummy test example Fasta .faa file):

# default (CPU-only or non-NVIDIA GPU e.g. Mac or AMD)
baktfold proteins -i tests/test_data/assembly.hypotheticals.faa  -o baktfold_proteins_output -f -t 8 -d baktfold_db   
# with Nvidia GPU 
baktfold proteins -i tests/test_data/assembly.hypotheticals.faa  -o baktfold_proteins_output -f -t 8 -d baktfold_db   --foldseek-gpu

Note that this can be any .faa. It does not have to be the output of Bakta.

Usage

The two most useful commands are baktfold run and baktfold proteins

baktfold run accepts a Bakta json file as input, and by default, it will annotate all hypothetical CDS and return a variety of Bakta-like compliant output formats. All other annotations will be inherited from the Bakta output
baktfold proteins accepts a protein FASTA .faa format file as input. It will annotate all protein sequences and return a variety of bakta_proteins-like output formats
baktfold predict and baktfold compare split baktfold run into the ProstT5 and Foldseek modules, while baktfold proteins-predict and baktfold proteins-compare do the same for baktfold proteins (useful if you have non-NVIDIA GPUs)

It is recommend you run Baktfold with a GPU if you can. If you do not have a GPU, Baktfold will still run, but the ProstT5 step will be fairly slow. If you have a NVIDIA GPU, you can also use the --foldseek-gpu parameter to accelerate Foldseek further

Usage: baktfold [OPTIONS] COMMAND [ARGS]...

  Main command line interface for baktfold.

  Returns:   None

  Examples:   >>> main_cli()   None

Options:
  -h, --help     Show this message and exit.
  -V, --version  Show the version and exit.

Commands:
  autotune          Determines optimal batch size for 3Di prediction with...
  citation          Print the citation(s) for this tool
  compare           Runs Foldseek vs baktfold db
  convert-euk       (Experimental) Converts eukaryotic GenBank to Bakta...
  convert-prokka    Converts Prokka GenBank to Bakta format json
  createdb          Creates foldseek DB from AA FASTA and 3Di FASTA input...
  install           Installs ProstT5 model and baktfold database
  predict           Uses ProstT5 to predict 3Di tokens - GPU recommended
  proteins          baktfold proteins-predict then comapare all in one -...
  proteins-compare  Runs Foldseek vs baktfold db on proteins input
  proteins-predict  Runs ProstT5 on a multiFASTA input - GPU recommended
  run               baktfold predict then comapare all in one - GPU...

Usage: baktfold run [OPTIONS]

  baktfold predict then comapare all in one - GPU recommended

Options:
  -h, --help                     Show this message and exit.
  -V, --version                  Show the version and exit.
  -i, --input PATH               Path to input file in Bakta Genbank format or
                                 Bakta JSON format  [required]
  -o, --output PATH              Output directory   [default: output_baktfold]
  -t, --threads INTEGER          Number of threads  [default: 1]
  -p, --prefix TEXT              Prefix for output files  [default: baktfold]
  -d, --database TEXT            Specific path to installed baktfold database
  -f, --force                    Force overwrites the output directory
  --autotune                     Run autotuning to detect and automatically
                                 use best batch size for your hardware.
                                 Recommended only if you have a large dataset
                                 (e.g. thousands of proteins), or else
                                 autotuning will add rather than save runtime.
  --batch-size INTEGER           batch size for ProstT5. 1 is usually fastest.
                                 [default: 1]
  --cpu                          Use cpus only.
  --omit-probs                   Do not output per residue 3Di probabilities
                                 from ProstT5. Mean per protein 3Di
                                 probabilities will always be output.
  --save-per-residue-embeddings  Save the ProstT5 embeddings per resuide in a
                                 h5 file
  --save-per-protein-embeddings  Save the ProstT5 embeddings as means per
                                 protein in a h5 file
  --mask-threshold FLOAT         Masks 3Di residues below this value of
                                 ProstT5 confidence for Foldseek searches
                                 [default: 25]
  -e, --evalue FLOAT             Evalue threshold for Foldseek  [default:
                                 1e-3]
  -s, --sensitivity FLOAT        Sensitivity parameter for foldseek  [default:
                                 9.5]
  --keep-tmp-files               Keep temporary intermediate files,
                                 particularly the large foldseek_results.tsv
                                 of all Foldseek hits
  --max-seqs INTEGER             Maximum results per query sequence allowed to
                                 pass the prefilter. You may want to reduce
                                 this to save disk space for enormous datasets
                                 [default: 1000]
  --ultra-sensitive              Runs baktfold with maximum sensitivity by
                                 skipping Foldseek prefilter. Not recommended
                                 for large datasets.
  --extra-foldseek-params TEXT   Extra foldseek search params
  --custom-db TEXT               Path to custom database
  --foldseek-gpu                 Use this to enable compatibility with
                                 Foldseek-GPU search acceleration
  --custom-annotations PATH      Custom Foldseek DB annotations, 2 column tsv.
                                 Column 1 matches the Foldseek headers, column
                                 2 is the description.
  --euk                          Eukaryotic input genome.
  --fast                         Skips Foldseek search against AFDB Clusters.
  -a, --all-proteins             annotate all proteins (not just
                                 hypotheticals)

Output

The majority of outputs match Bakta. Specifically, all the format compliant outputs match Bakta's.

The differences are:

<prefix>.inference.tsv is different compared to Bakta.
In Baktfold, this file gives a quick overview of the different Baktfold databases for which the query protein has a hit (if any)

For example:

ID	Length	Product	Swissprot	AFDBClusters	PDB	CATH
MEGJMNBEGN_27	162	HTH-type quorum-sensing regulator RhlR	swissprot_P54292	afdbclusters_A0A9E1VSB0	pdb_5l09	cath_3sztB01
MEGJMNBEGN_30	68	hypothetical protein				
MEGJMNBEGN_70	94	hypothetical protein		afdbclusters_A0A1I3V7E0

<prefix>_<database>_tophit.tsv files give the detailed Foldseek alignment information for each tophit found for each database.

For example:

query	target	bitscore	fident	evalue	qStart	qEnd	qLen	qCov	tStart	tEnd	tLen	tCov
MEGJMN_070	AF-A0A1I3V7E0-F1-model_v6	292	0.41	2.619e-06	1	91	93	0.97	1	95	99	0.95

The full Foldseek search outputs are not kept by default (only tophits). You can keep the full Foldseek search TSVs using --keep-tmp-files. They will be called foldseek_results_<database>.tsv.
baktfold_3di.fasta which gives the 3Di tokens for each input CDS
baktfold_prostT5_3di_mean_probabilities.csv and baktfold_prostT5_3di_all_probabilities.json, which give some score of the confidence ProstT5 has in its predictions. You can disable this output with --omit-probs
Baktfold does not have plotting functionality like Bakta (yet)

Conceptual terms

As Baktfold inherits annotations and related conceptual terms from Bakta. Hence, we kindly refer to Bakta's readme. In addition, Baktfold introduces one conceptual term:

PSTC: protein structure clusters. These comprise of structure-based annotations to any of Baktfold's databases

Citations

A manuscript describing Baktfold is in preparation

Please, be sure to cite the following core dependencies - citing all bioinformatics tools that you use helps us, so helps you get better bioinformatics tools:

Foldseek - (https://github.com/steineggerlab/foldseek) van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist C, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology (2023), doi:10.1038/s41587-023-01773-0
ProstT5 - (https://github.com/mheinzinger/ProstT5) Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Martin Steinegger, Burkhard Rost. ProstT5: Bilingual language model for protein sequence and structure. NAR Genomics and Bioinformatics (2024) doi:10.1101/2023.07.23.550085

Please also consider citing these databases where relevant:

AFDB/SwissProt - Mihaly Varadi, Damian Bertoni, Paulyna Magana, Urmila Paramval, Ivanna Pidruchna, Malarvizhi Radhakrishnan, Maxim Tsenkov, Sreenath Nair, Milot Mirdita, Jingi Yeo, Oleg Kovalevskiy, Kathryn Tunyasuvunakool, Agata Laydon, Augustin Žídek, Hamish Tomlinson, Dhavanthi Hariharan, Josh Abrahamson, Tim Green, John Jumper, Ewan Birney, Martin Steinegger, Demis Hassabis, Sameer Velankar, AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences, Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D368–D375, https://doi.org/10.1093/nar/gkad1011
CATH - Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH--a hierarchic classification of protein domain structures. Structure. 1997 Aug 15;5(8):1093-108. doi: 10.1016/s0969-2126(97)00260-8. PMID: 9309224.
PDB - H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The Protein Data Bank (2000) Nucleic Acids Research 28: 235-242 https://doi.org/10.1093/nar/28.1.235

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
.github		.github
db-scripts		db-scripts
docs		docs
src/baktfold		src/baktfold
tests		tests
.gitignore		.gitignore
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yml		environment.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
run_baktfold.ipynb		run_baktfold.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Baktfold

Google Colab Notebook

Table of Contents

Install

Conda (recommended)

Pip

Source

Database Installation

Example

Usage

Output

Conceptual terms

Citations

About

Uh oh!

Releases 3

Packages

Contributors 2

Uh oh!

Languages

License

gbouras13/baktfold

Folders and files

Latest commit

History

Repository files navigation

Baktfold

Google Colab Notebook

Table of Contents

Install

Conda (recommended)

Pip

Source

Database Installation

Example

Usage

Output

Conceptual terms

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Uh oh!

Languages

Packages