ProtACon is a wide project aimed to explore and interpret the representations generated by transformers when they are applied to proteins. Take a look at our experiment report to discover what you can do with ProtACon. 🧑🔬
Note
You are on the main, stable branch. Check out the "advanced" branch for an extended (yet less documented) version with more features.
The goal is to detect possible connections and similarities between the attention weights generated by the ProtBert transformer, and the physico-chemical properties of the proteins that are fed into it.
This project was inspired by the work of Jesse Vig and colleagues "BERTology Meets Biology", which proved that BERT-based models are able to capture high-level structural properties of proteins, just by providing them with the sequence of amino acids in the protein chains.
👉 Check out the code documentation at the reference guide.
The whole pipeline is founded on two pillars: the ProtBert encoder for the extraction of the attention, and the RCSB Protein Data Bank to get the protein structural information.
Starting from a PDB entry—a 4-digits alphanumerical code uniquely identifying one protein—the PDB file of the corresponding protein is downloaded. Proteins often have more than one chain, so only the first one is picked. From that the sequence of amino acids of the residues in the protein is stored, together with the coordinates of the α-carbon atom of each residue. The amino acid sequence is then passed to the ProtBert encoder, where it is processed and from which attention from each head of the model is extracted.
ProtACon has two possible uses: either on single peptide chains, or on a set of them.
For both uses, the main results you get from the run are the attention alignment with the protein contact map and the attention similarity between amino acids—go look at our report for their definitions. Beside that, a bunch of other quantities are computed and saved in dedicated folders.
The run on a set of chains does not save on your device the quantities relative to the single chains—unless the contrary is provided—but computes and stores the averages of those quantities relative to the whole protein set. Find the guide to the output files in the wiki section.
ProtACon integrates the PDB Search API in its pipeline. Thus, when running on sets of proteins, you have two ways to choose the composition of the set:
- by passing the complete list of PDB entries;
- by providing some parameters for a search in the PDB API, such as the minimum and maximum numbers of residues in each chain, and the numbers of chains making up the set.
Look at the wiki section for more information about configuring your experiment.
Two prerequisites are needed:
-
An environment with Python-3.10.15 🐍.
-
The GCC package installed—it is required for Biopython to work correctly.
-
If you are on a conda environment, you can install it with:
conda install conda-forge::gcc
-
Otherwise, you can run the following command (requires root privileges):
apt-get install gcc
A correct functioning of the code was verified with
gcc-11.4.0
andgcc-14.2.0
. -
To install ProtACon, execute the following commands:
git clone https://github.com/sim1-99/ProtACon.git`
cd ProtACon
Then, install with:
pip install .
Once you installed it, you can run the application from any path.
To install the repo in developer mode, run this command instead:
pip install -e .
This installation gives you the chance of editing the code without having to reinstall it every time to make changes effective.
If you also want to install the required packages to run the test suite, run:
pip install .'[test]'
or, in developer mode:
pip install -e .'[test]'
The list of dependecies downloaded can be found in pyproject.toml.
You can launch different scripts by typing in the command line ProtACon
followed by one of the following commands.
-
on_chain
If you want to analyze a single protein, namely 6NJC, then run:
ProtACon on_chain 6NJC
Optional flags
-v
,--verbose
: print to the terminal more info about the run.
-
on_set
If you want to perform an analysis on a set of proteins, then run:
ProtACon on_set
Optional flags
-s
,--save_every
: choose one between ["none", "plot", "csv", "both"] to selectively save files relative to the single peptide chains in the set. Files relative to the set are always saved instead.- If
--save_every
is not added at all, it is like "none" is passed, and no files relative to single chains are saved. - If
--save_every plot
is added, save the plots relative to every single chain in the set in a dedicated folder. - If
--save_every csv
is added, save the csv files with the amino acid occurrences in every single chain in the set in a dedicated folder. - If
--save_every
is added with no options, "both" is passed, which would be like passing both "plot" and "csv".
-v
,--verbose
: print to the terminal more info about the run.
- If
Warning
ProtACon does not overwrite existing plots. If you run the code passing the same plot folder as a previous run, no plots will be saved.
Important
Running tests requires you to include the section '[test]'
when installing (see the section Installation).
When in the main ProtACon folder, you can run all the tests just with the command pytest
, as well as tests on single modules and functions by launching:
pytest -m <marker>
<marker> being one of the markers in section [tool.pytest.ini_options] of pyproject.toml.
Finally, running:
pytest .
will also run the plugin pytest-pycodestyle, checking the code against PEP8 style conventions.