A PyTorch implementation of: EquiCPI: SE(3)-Equivariant Geometric Deep Learning for Structure-Aware Prediction of Compound-Protein Interactions
EquiCPI is a novel model designed to leverage the full SE(3) Euclidean group by incorporating multiple e3nn neural networks to predict binding affinity free energy. These networks apply principles of equivariance and invariance to process 3D molecular structures, ensuring robustness against transformations such as rotations, translations, and reflections. Here we used the predicted 3D structure of compounds by adopting Diffdock-L and the predicted 3D fold of protein sequence by using ESMFold. Traditional sequence-based models for compound-protein interaction (CPI) prediction often rely on molecular fingerprints, descriptors, or graph representations. These approaches tend to overlook the significant information of three-dimensional (3D) structures. To address this limitation, we developed a novel model, EquiCPI, based on Euclidean neural networks (e3nns), which leverage the SE(3) (Euclidean group) group to predict binding affinity. The model leverages principles of equivariance and invariance, enabling it to extract 3D information while maintaining consistency across transformations such as rotations, translations, and reflections. We utilized predicted 3D structures from sequence data of compounds from state-of-the-art DiffDock-L and 3D protein folds from ESMFold to train and validate the proposed model.
To achieve this, we use:
- DiffDock-L for predicting the 3D structures of compounds.
- ESMFold for predicting protein 3D folds from sequences.
Traditional sequence-based CPI prediction models rely on molecular fingerprints, descriptors, or graphs, often overlooking critical 3D structural information. EquiCPI, built on Euclidean neural networks (e3nn), fully utilizes the SE(3) group to process 3D structures, providing more accurate and structure-aware CPI predictions.
- Python 3.9
- PyTorch 2.1.2 + CUDA 11.8
# Clone the repository
git clone https://github.com/dmis-lab/EquiCPI.git
# Create and activate the environment
conda env create -f environment.yml
To convert a .pdb file into a .pt file containing a 3D protein graph:
python generate_graph_for_protein.py #output_ESM #file_protein_name.csv #processed_dir #name_of_file.pt
Our workflow starts with:
- SMILES strings representing compounds.
- Amino acid sequences defining proteins.
- DiffDock-L & ESMFold generating 3D structures of compounds and proteins.
- AutoDock Vina predicting binding affinities and identifying optimal docking poses.
To re-rank predicted complexes based on Vina docking scores, run:
python ./vina_score/vina_function_rerank_regu.py #prediction_output_diffdock #dataset.csv(with compound.sdf, protein.pdb)
Note: dataset.csv
must contain compound.sdf
and protein.pdb
files.
python generate_pt_dataset.py #machine_learning_task #data_name #data_csv_file.csv
bash run_class.sh
We utilize several datasets for training and evaluation:
EquiCPI builds upon the source code and data from the following projects:
- DiffDock-L – Deep confident steps to new pockets.
- ESMFold – Atomic-level protein structure prediction.
- AutoDock-Vina – Molecular docking software.
We sincerely appreciate all contributors and maintainers for their efforts! 🙌
This repository follows the license terms of the EquiCPI project. ## License MIT.
If you use this code or dataset in your research, please cite:
@misc{nguyen2025equicpise3equivariantgeometricdeep,
title={EquiCPI: SE(3)-Equivariant Geometric Deep Learning for Structure-Aware Prediction of Compound-Protein Interactions},
author={Ngoc-Quang Nguyen},
year={2025},
eprint={2504.04654},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.04654},
}