RNADiffFold: Generative RNA Secondary Structure Prediction using Discrete Diffusion Models

Abstract

RNA molecules are essential macromolecules that perform diverse biological functions in living beings. Precise prediction of RNA secondary structures is instrumental in deciphering their complex three-dimensional architecture and functionality. Traditional methodologies for RNA structure prediction, including energy-based and learning-based approaches, often depict RNA secondary structures from a static perspective and rely on stringent a priori constraints. Inspired by the success of diffusion models, in this work, we introduce RNADiffFold, an innovative generative prediction approach of RNA secondary structures based on multinomial diffusion. We reconceptualize the prediction of contact maps as akin to pixel-wise segmentation and accordingly train a denoising model to refine the contact maps starting from a noise-infused state progressively. We also devise a potent conditioning mechanism that harnesses features extracted from RNA sequences to steer the model toward generating an accurate secondary structure. These features encompass one-hot encoded sequences, probabilistic maps generated from a pre-trained scoring network, and embeddings and attention maps derived from RNA-FM. Experimental results on both within- and cross-family datasets demonstrate RNADiffFold's competitive performance compared with current state-of-the-art methods. Additionally, RNADiffFold has shown a notable proficiency in capturing the dynamic aspects of RNA structures, a claim corroborated by its performance on datasets comprising multiple conformations.

Prerequisites

python >= 3.8
torch >= 2.0.1 with cudnn >= 11.8

⭐ Note:

Before using the requirements.yml file, please update the prefix path in the last line to match your own system's path.
Use the following command to create the environment.

conda env create -f requirements.yml

Activate the environment.

conda activate RNADiffFold

Pre-trained Models and using data

Pre-trained models are available in the checkpoint. The training and evaluation data are stored in the data, with all data preprocessed for computational efficiency.

Usage

Training

We provide the data used for training and evaluating RNADiffFold. Please download the data and place it in the ./data directory. If you wish to train the model with your own data, please preprocess it using the scripts available in the ./preprocess_data directory.

We utilize the pretrained weights of the Ufold and RNA-FM to condition the model. If you wish to train the RNADiffFold model from scratch, please download the conditioner pretrained weights from checkpoint and place them in the ./ckpt/cond_ckpt.

Then, run the following command to train the model:

python train.py --device cuda:0
                --diffusion_dim 8
                --diffusion_steps 20
                --cond_dim 8
                --dataset all
                --batch_size 1
                --dp_rate 0.1
                --lr 0.0001
                --warmup 5
                --seed 2023
                --log_wandb True
                --epochs 400
                --eval_every 20
                -u_conditioner_ckpt ufold_train_alldata.pt

Evaluating

We provide the test script for user to evaluate the prediction result using the following command:

python evaluation/eval.py

The predict results for each sequence will be stored in the ./evaluation/results directory.

Predicting

We provide the predict script for user to predict the secondary structure of the RNA sequence. Users can put the RNA sequence data in ./prediction/predict_data in fasta format. Then, run the following command to predict the secondary structure:

python prediction/predict.py

The predict results for each sequence will be stored in the ./prediction/predict_results/ct_files directory.

Acknowledgements

This project draws inspiration from multinomial diffusion. We extend our gratitude to the authors for their outstanding research and code, and we hope that readers will find their contributions equally valuable.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
common		common
datasets		datasets
evaluation		evaluation
figures		figures
models		models
optim		optim
prediction		prediction
preprocess_data		preprocess_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
experiment.py		experiment.py
requirements.yml		requirements.yml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNADiffFold: Generative RNA Secondary Structure Prediction using Discrete Diffusion Models

Abstract

Prerequisites

Pre-trained Models and using data

Usage

Training

Evaluating

Predicting

Acknowledgements

About

Releases

Packages

Languages

License

HIM-AIM/RNADiffFold

Folders and files

Latest commit

History

Repository files navigation

RNADiffFold: Generative RNA Secondary Structure Prediction using Discrete Diffusion Models

Abstract

Prerequisites

Pre-trained Models and using data

Usage

Training

Evaluating

Predicting

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages