ProtACon

ProtACon is a wide project aimed to explore and interpret the representations generated by transformers when they are applied to proteins. Take a look at our experiment report to discover what you can do with ProtACon. 🧑‍🔬

Note

You are on the main, stable branch. Check out the "advanced" branch for an extended (yet less documented) version with more features.

The goal is to detect possible connections and similarities between the attention weights generated by the ProtBert transformer, and the physico-chemical properties of the proteins that are fed into it.

This project was inspired by the work of Jesse Vig and colleagues "BERTology Meets Biology", which proved that BERT-based models are able to capture high-level structural properties of proteins, just by providing them with the sequence of amino acids in the protein chains.

👉 Check out the code documentation at the reference guide.

How ProtACon works

The whole pipeline is founded on two pillars: the ProtBert encoder for the extraction of the attention, and the RCSB Protein Data Bank to get the protein structural information.

Starting from a PDB entry—a 4-digits alphanumerical code uniquely identifying one protein—the PDB file of the corresponding protein is downloaded. Proteins often have more than one chain, so only the first one is picked. From that the sequence of amino acids of the residues in the protein is stored, together with the coordinates of the α-carbon atom of each residue. The amino acid sequence is then passed to the ProtBert encoder, where it is processed and from which attention from each head of the model is extracted.

What ProtACon does

ProtACon has two possible uses: either on single peptide chains, or on a set of them.

For both uses, the main results you get from the run are the attention alignment with the protein contact map and the attention similarity between amino acids—go look at our report for their definitions. Beside that, a bunch of other quantities are computed and saved in dedicated folders.

The run on a set of chains does not save on your device the quantities relative to the single chains—unless the contrary is provided—but computes and stores the averages of those quantities relative to the whole protein set. Find the guide to the output files in the wiki section.

ProtACon integrates the PDB Search API in its pipeline. Thus, when running on sets of proteins, you have two ways to choose the composition of the set:

by passing the complete list of PDB entries;
by providing some parameters for a search in the PDB API, such as the minimum and maximum numbers of residues in each chain, and the numbers of chains making up the set.

Look at the wiki section for more information about configuring your experiment.

Quickstart

Prerequisites

Two prerequisites are needed:

An environment with Python-3.10.15 🐍.
The GCC package installed—it is required for Biopython to work correctly.
- If you are on a conda environment, you can install it with:
```
conda install conda-forge::gcc
```
- Otherwise, you can run the following command (requires root privileges):
```
apt-get install gcc
```
A correct functioning of the code was verified with gcc-11.4.0 and gcc-14.2.0.

Installation

To install ProtACon, execute the following commands:

git clone https://github.com/sim1-99/ProtACon.git`
cd ProtACon

Then, install with:

pip install .

Once you installed it, you can run the application from any path.

To install the repo in developer mode, run this command instead:

pip install -e .

This installation gives you the chance of editing the code without having to reinstall it every time to make changes effective.

If you also want to install the required packages to run the test suite, run:

pip install .'[test]'

or, in developer mode:

pip install -e .'[test]'

The list of dependecies downloaded can be found in pyproject.toml.

Running the code

You can launch different scripts by typing in the command line ProtACon followed by one of the following commands.

on_chain

If you want to analyze a single protein, namely 6NJC, then run:
```
ProtACon on_chain 6NJC
```
Optional flags

-v, --verbose: print to the terminal more info about the run.
on_set

If you want to perform an analysis on a set of proteins, then run:
```
ProtACon on_set
```
Optional flags

-s, --save_every: choose one between ["none", "plot", "csv", "both"] to selectively save files relative to the single peptide chains in the set. Files relative to the set are always saved instead.
- If --save_every is not added at all, it is like "none" is passed, and no files relative to single chains are saved.
- If --save_every plot is added, save the plots relative to every single chain in the set in a dedicated folder.
- If --save_every csv is added, save the csv files with the amino acid occurrences in every single chain in the set in a dedicated folder.
- If --save_every is added with no options, "both" is passed, which would be like passing both "plot" and "csv".
-v, --verbose: print to the terminal more info about the run.

Warning

ProtACon does not overwrite existing plots. If you run the code passing the same plot folder as a previous run, no plots will be saved.

Running tests

Important

Running tests requires you to include the section '[test]' when installing (see the section Installation).

When in the main ProtACon folder, you can run all the tests just with the command pytest, as well as tests on single modules and functions by launching:

pytest -m <marker>

<marker> being one of the markers in section [tool.pytest.ini_options] of pyproject.toml.

Finally, running:

pytest .

will also run the plugin pytest-pycodestyle, checking the code against PEP8 style conventions.

Name		Name	Last commit message	Last commit date
Latest commit History 404 Commits
ProtACon		ProtACon
docs		docs
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
config.txt		config.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProtACon

How ProtACon works

What ProtACon does

Quickstart

Prerequisites

Installation

Running the code

Running tests

About

Releases 2

Packages

Languages

License

sim1-99/ProtACon

Folders and files

Latest commit

History

Repository files navigation

ProtACon

How ProtACon works

What ProtACon does

Quickstart

Prerequisites

Installation

Running the code

Running tests

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages