ProtFill is an inpainting protein sequence and structure co-design model that works on antibodies as well as other proteins.
Our model uses custom GVPe message passing layers, which are a modification of GVP with edge updates.
cd protfill
conda create --name protfill python=3.10
conda activate protfill
python -m pip install .
python -m pip install torch_geometric torch_scatterThe datasets can be downloaded from proteinflow.
proteinflow download --tag 20230102_stable
proteinflow download --tag 20230626_sabdab --skip_splitting
rm -r data/proteinflow_20230626_sabdab/splits_dict/
cp -r data/splits_dict data/proteinflow_20230626_sabdab/
proteinflow split --tag 20230626_sabdabThere are four models in this repository and they can be tested or replicated with corresponding config files. The differences between the models are explained in the table below. Noising scheme here refers to either replacing the masked data with samples from a gaussian distribution (standard) or corrupting it with noise (alternative).
| Name | Dataset | Diffusion | Noising scheme |
|---|---|---|---|
| protfill_ab | antibody | no | standard |
| proftilldiff | antibody | yes | standard |
| protfill_ppi_standard_noising | diverse | no | standard |
| protfill_ppi_alternative_noising | diverse | no | alternative |
In order to retrain one of the models, run this command with one of the config names.
protfill --config configs/train/NAME.yaml --dataset_path DATASET_PATHAn example can look like this.
protfill --config configs/train/protfill_ab.yaml --dataset_path data/proteinflow_20230626_sabdabIn order to test one of our pre-trained models on the 'easy' test subset, run the following.
protfill --config configs/test/NAME.yaml --dataset_path DATASET_PATH --easy_testTo test on the 'hard' subset, replace --easy_test with --hard_test. To test on a specific CDR, add i.e. --redesign_cdr H3. Note that the 'hard' antibody subset does not contain light chains and the diverse dataset does not have CDRs or an 'easy' test subset.
To redesign a part of a new file, run this. The file can have either a .pdb or a .pickle extension, with the pickle files being generated by proteinflow.
protfill --config configs/test/NAME.yaml --redesign_file 7kgk.pdbBy default this command will redesign a random part of the protein. To redesign specific positions, use the --redesign_positions option. This argument should be in the format of chain:start1-end1,start2-end2, e.g. A:5-10,20-21,30-40. The numbering is 0-indexed, the starts are included in the selected slice and the ends are not. In case of PDB files, the chain name is the author name. In case of pickle files, the numbering should be based on the fasta chain. If the file was generated with proteinflow with CDR information, this can also be used with a --redesign_cdr CDR option to redesign a specific CDR, e.g. --redesign_cdr H3.

