TMbed - Transmembrane proteins predicted through Language Model embeddings

TMbed predicts transmembrane beta barrel and alpha helical proteins, the position and orientation of their transmembrane segments, and signal peptides. We use a Protein Language Model, ProtT5-XL-U50 [1], to generate embeddings used as input for our method.

Pre-Print: bioRxiv
Publication: BMC Bioinformatics

TMbed is also available via bio_embeddings and LambdaPP [2].
Or you can try out TMbed using Google Colab.

Visit TMvisDB [3] to see precomputed predictions for AlphaFold DB [4] structures.

With the predict command you can generate predictions for a set of proteins.
The only required input is a FASTA file containing the protein sequences. TMbed will generate the needed embeddings on the fly. If you also supply a file with embeddings, those embeddings will be used and only the subset of proteins contained within both input files will be predicted.

python -m tmbed predict -f sample.fasta -p sample.pred

python -m tmbed predict -f sample.fasta -e sample.h5 -p sample.pred

Optional arguments

--out-format sets the output format for the prediction file.

--batch-size is an approximation of how many residues should be included per batch.
Each batch is constrained by N * L^1.5 ≤ BS^1.5, where N is the number of sequences in the batch, L is the length of the longest sequence in the batch, and BS is the batch size. Batches with only a single sequence can break this restriction.

--use-gpu / --no-use-gpu controls whether TMbed will try to use an available GPU to speed up computations.

--cpu-fallback / --no-cpu-fallback controls whether TMbed will try to use the CPU if it fails to compute the embeddings on GPU.

Hardware requirements

When in half-precision mode, the ProtT5-XL-U50 encoder needs about 2.5 GB of VRAM on the GPU.

Additional memory requirements to generate embeddings depend heavily on the sequence length.
We recommend a GPU with at least 12GB of VRAM, which is enough for sequences of up to ~4200 residues.

If you run into "out of memory" issues, try reducing the batch size.

Prediction output

TMbed supports five different output formats:

0: 3-line format with directed segments.
1: 3-line format with undirected segments.
2: Tabular format with directed segments.
3: Tabular format with undirected segments.
4: 3-line format with directed segments and explicit inside/outside prediction (a mix of format 0 and 1).

Predicted residue classes are encoded by single letters.
In 3-line format, every protein is represented by three lines: header, sequence, labels.
In tabular format, every protein is represented by a table containing sequence, labels, and class probabilities.

--out-format=0 (default)

B: Transmembrane beta strand (IN-->OUT orientation)
b: Transmembrane beta strand (OUT-->IN orientation)
H: Transmembrane alpha helix (IN-->OUT orientation)
h: Transmembrane alpha helix (OUT-->IN orientation)
S: Signal peptide
.: Non-Transmembrane

>7acg_A|P18895|ALGE_PSEAE
MNSSRSVNPRPSFAPRALSLAIALLLGAPAFAANSGEAPKNFGLDVKITGESENDRDLGTAPGGTLNDIGIDLRPWAFGQWGDWSAYFMGQAVAATDTIETDTLQSDTDDGNNSRNDGREPDKSYLAAREFWVDYAGLTAYPGEHLRFGRQRLREDSGQWQDTNIEALNWSFETTLLNAHAGVAQRFSEYRTDLDELAPEDKDRTHVFGDISTQWAPHHRIGVRIHHADDSGHLRRPGEEVDNLDKTYTGQLTWLGIEATGDAYNYRSSMPLNYWASATWLTGDRDNLTTTTVDDRRIATGKQSGDVNAFGVDLGLRWNIDEQWKAGVGYARGSGGGKDGEEQFQQTGLESNRSNFTGTRSRVHRFGEAFRGELSNLQAATLFGSWQLREDYDASLVYHKFWRVDDDSDIGTSGINAALQPGEKDIGQELDLVVTKYFKQGLLPASMSQYVDEPSALIRFRGGLFKPGDAYGPGTDSTMHRAFVDFIWRF
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.........BBBBBBBBBB.................bbbbbbbbbbb.....BBBBBBBBBB...............................bbbbbbbbbb.........BBBBBB...............bbbbbbbb....BBBBBBBB....................bbbbbbbb......BBBBBBBB..........................bbbbbbbb..........BBBBBBBBBB............................bbbbbbbbbb.....BBBBBBBB..............................................bbbbbbbbb.....BBBBBBBB............................bbbbbbbbb..................BBBBBBBBBB...............bbbbbbbbb.

--out-format=1

B: Transmembrane beta strand
H: Transmembrane alpha helix
S: Signal peptide
i: Non-Transmembrane, inside
o: Non-Transmembrane, outside

>7acg_A|P18895|ALGE_PSEAE
MNSSRSVNPRPSFAPRALSLAIALLLGAPAFAANSGEAPKNFGLDVKITGESENDRDLGTAPGGTLNDIGIDLRPWAFGQWGDWSAYFMGQAVAATDTIETDTLQSDTDDGNNSRNDGREPDKSYLAAREFWVDYAGLTAYPGEHLRFGRQRLREDSGQWQDTNIEALNWSFETTLLNAHAGVAQRFSEYRTDLDELAPEDKDRTHVFGDISTQWAPHHRIGVRIHHADDSGHLRRPGEEVDNLDKTYTGQLTWLGIEATGDAYNYRSSMPLNYWASATWLTGDRDNLTTTTVDDRRIATGKQSGDVNAFGVDLGLRWNIDEQWKAGVGYARGSGGGKDGEEQFQQTGLESNRSNFTGTRSRVHRFGEAFRGELSNLQAATLFGSWQLREDYDASLVYHKFWRVDDDSDIGTSGINAALQPGEKDIGQELDLVVTKYFKQGLLPASMSQYVDEPSALIRFRGGLFKPGDAYGPGTDSTMHRAFVDFIWRF
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSiiiiiiiiiBBBBBBBBBBoooooooooooooooooBBBBBBBBBBBiiiiiBBBBBBBBBBoooooooooooooooooooooooooooooooBBBBBBBBBBiiiiiiiiiBBBBBBoooooooooooooooBBBBBBBBiiiiBBBBBBBBooooooooooooooooooooBBBBBBBBiiiiiiBBBBBBBBooooooooooooooooooooooooooBBBBBBBBiiiiiiiiiiBBBBBBBBBBooooooooooooooooooooooooooooBBBBBBBBBBiiiiiBBBBBBBBooooooooooooooooooooooooooooooooooooooooooooooBBBBBBBBBiiiiiBBBBBBBBooooooooooooooooooooooooooooBBBBBBBBBiiiiiiiiiiiiiiiiiiBBBBBBBBBBoooooooooooooooBBBBBBBBBi

--out-format=2 and --out-format=3
- AA: Amino acid
- PRD: Predicted class label
- P(B): Probability for class 'transmembrane beta strand'
- P(H): Probability for class 'transmembrane alpha helix'
- P(S): Probability for class 'signal peptide'
- P(i): Probability for class 'non-transmembrane, inside'
- P(o): Probability for class 'non-transmembrane, outside'
--out-format=2 uses the same class labels as --out-format=0.
--out-format=3 uses the same class labels as --out-format=1.
```
>7acg_A|P18895|ALGE_PSEAE
AA  PRD P(B)    P(H)    P(S)    P(i)    P(o)
M   S   0.00    0.00    0.94    0.05    0.00
N   S   0.00    0.00    0.98    0.02    0.00
S   S   0.00    0.00    0.99    0.01    0.00
S   S   0.00    0.00    0.99    0.01    0.00
R   S   0.00    0.00    1.00    0.00    0.00
S   S   0.00    0.00    1.00    0.00    0.00
V   S   0.00    0.00    0.99    0.00    0.00
N   S   0.00    0.00    0.99    0.01    0.00
...
```

--out-format=4

B: Transmembrane beta strand (IN-->OUT orientation)
b: Transmembrane beta strand (OUT-->IN orientation)
H: Transmembrane alpha helix (IN-->OUT orientation)
h: Transmembrane alpha helix (OUT-->IN orientation)
S: Signal peptide
i: Non-Transmembrane, inside
o: Non-Transmembrane, outside

>7acg_A|P18895|ALGE_PSEAE
MNSSRSVNPRPSFAPRALSLAIALLLGAPAFAANSGEAPKNFGLDVKITGESENDRDLGTAPGGTLNDIGIDLRPWAFGQWGDWSAYFMGQAVAATDTIETDTLQSDTDDGNNSRNDGREPDKSYLAAREFWVDYAGLTAYPGEHLRFGRQRLREDSGQWQDTNIEALNWSFETTLLNAHAGVAQRFSEYRTDLDELAPEDKDRTHVFGDISTQWAPHHRIGVRIHHADDSGHLRRPGEEVDNLDKTYTGQLTWLGIEATGDAYNYRSSMPLNYWASATWLTGDRDNLTTTTVDDRRIATGKQSGDVNAFGVDLGLRWNIDEQWKAGVGYARGSGGGKDGEEQFQQTGLESNRSNFTGTRSRVHRFGEAFRGELSNLQAATLFGSWQLREDYDASLVYHKFWRVDDDSDIGTSGINAALQPGEKDIGQELDLVVTKYFKQGLLPASMSQYVDEPSALIRFRGGLFKPGDAYGPGTDSTMHRAFVDFIWRF
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSiiiiiiiiiBBBBBBBBBBooooooooooooooooobbbbbbbbbbbiiiiiBBBBBBBBBBooooooooooooooooooooooooooooooobbbbbbbbbbiiiiiiiiiBBBBBBooooooooooooooobbbbbbbbiiiiBBBBBBBBoooooooooooooooooooobbbbbbbbiiiiiiBBBBBBBBoooooooooooooooooooooooooobbbbbbbbiiiiiiiiiiBBBBBBBBBBoooooooooooooooooooooooooooobbbbbbbbbbiiiiiBBBBBBBBoooooooooooooooooooooooooooooooooooooooooooooobbbbbbbbbiiiiiBBBBBBBBoooooooooooooooooooooooooooobbbbbbbbbiiiiiiiiiiiiiiiiiiBBBBBBBBBBooooooooooooooobbbbbbbbbi

Precomputed predictions

We provide precomputed predictions for the human proteome and for UniProtKB/Swiss-Prot.

Human (21-04-2022): Download
UniProtKB/Swiss-Prot (11-05-2022): Download

Roadmap

References

[1] Elnaggar A, Heinzinger M, Dallago C, Rihawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Bhowmik D, Rost B (2021). ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Transactions on Pattern Analysis and Machine Intelligence. doi: 10.1109/TPAMI.2021.3095381.

[2] Olenyi T, Marquet C, Heinzinger M, Kröger B, Nikolova T, Bernhofer M, Sändig P, Schütze K, Littmann M, Mirdita M, Steinegger M, Dallago C, Rost B (2023). LambdaPP: Fast and accessible protein-specific phenotype predictions. Protein Sci, 32, 1:e4524.

[3] Marquet C, Grekova A, Houri L, Bernhofer M, Jimenez-Soto L F, Karl T, Heinzinger M, Dallago C, Rost B (2022). TMvisDB: resource for transmembrane protein annotation and 3D visualization. bioRxiv, 2022.11.30.518551.

[4] Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A, Zidek A, Green T, Tunyasuvunakool K, Petersen S, Jumper J, Clancy E, Green R, Vora A, Lutfi M, Figurnov M, Cowie A, Hobbs N, Kohli P, Kleywegt G, Birney E, Hassabis D, Velankar S (2022). AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res, 50, D1:D439-D444.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
examples		examples
tmbed		tmbed
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TMbed - Transmembrane proteins predicted through Language Model embeddings

Table of Contents

Install

PyTorch (GPU or CPU only)

Requirements

Usage

First run

Generate embeddings for a set of protein sequences

Predict transmembrane proteins and segments

Optional arguments

Hardware requirements

Prediction output

Precomputed predictions

Roadmap

References

About

Releases 1

Packages

Contributors 3

Languages

License

BernhoferM/TMbed

Folders and files

Latest commit

History

Repository files navigation

TMbed - Transmembrane proteins predicted through Language Model embeddings

Table of Contents

Install

PyTorch (GPU or CPU only)

Requirements

Usage

First run

Generate embeddings for a set of protein sequences

Predict transmembrane proteins and segments

Optional arguments

Hardware requirements

Prediction output

Precomputed predictions

Roadmap

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages