RNA molecules are essential macromolecules that perform diverse biological functions in living beings. Precise prediction of RNA secondary structures is instrumental in deciphering their complex three-dimensional architecture and functionality. Traditional methodologies for RNA structure prediction, including energy-based and learning-based approaches, often depict RNA secondary structures from a static perspective and rely on stringent a priori constraints. Inspired by the success of diffusion models, in this work, we introduce RNADiffFold, an innovative generative prediction approach of RNA secondary structures based on multinomial diffusion. We reconceptualize the prediction of contact maps as akin to pixel-wise segmentation and accordingly train a denoising model to refine the contact maps starting from a noise-infused state progressively. We also devise a potent conditioning mechanism that harnesses features extracted from RNA sequences to steer the model toward generating an accurate secondary structure. These features encompass one-hot encoded sequences, probabilistic maps generated from a pre-trained scoring network, and embeddings and attention maps derived from RNA-FM. Experimental results on both within- and cross-family datasets demonstrate RNADiffFold's competitive performance compared with current state-of-the-art methods. Additionally, RNADiffFold has shown a notable proficiency in capturing the dynamic aspects of RNA structures, a claim corroborated by its performance on datasets comprising multiple conformations.
- python >= 3.8
- torch >= 2.0.1 with cudnn >= 11.8
⭐ Note:
- Before using the
requirements.yml
file, please update the prefix path in the last line to match your own system's path. - Use the following command to create the environment.
conda env create -f requirements.yml
- Activate the environment.
conda activate RNADiffFold
Pre-trained models are available in the checkpoint. The training and evaluation data are stored in the data, with all data preprocessed for computational efficiency.
We provide the data used for training and evaluating RNADiffFold. Please download the data and place it in the ./data
directory. If you wish to train the model with your own data, please preprocess it using the scripts available in the ./preprocess_data
directory.
We utilize the pretrained weights of the Ufold and RNA-FM to condition the model. If you wish to train the RNADiffFold model from scratch, please download the conditioner pretrained weights from checkpoint and place them in the ./ckpt/cond_ckpt
.
Then, run the following command to train the model:
python train.py --device cuda:0
--diffusion_dim 8
--diffusion_steps 20
--cond_dim 8
--dataset all
--batch_size 1
--dp_rate 0.1
--lr 0.0001
--warmup 5
--seed 2023
--log_wandb True
--epochs 400
--eval_every 20
-u_conditioner_ckpt ufold_train_alldata.pt
We provide the test script for user to evaluate the prediction result using the following command:
python evaluation/eval.py
The predict results for each sequence will be stored in the ./evaluation/results
directory.
We provide the predict script for user to predict the secondary structure of the RNA sequence. Users can put the RNA sequence data in ./prediction/predict_data
in fasta format. Then, run the following command to predict the secondary structure:
python prediction/predict.py
The predict results for each sequence will be stored in the ./prediction/predict_results/ct_files
directory.
This project draws inspiration from multinomial diffusion. We extend our gratitude to the authors for their outstanding research and code, and we hope that readers will find their contributions equally valuable.