Aptamers are single-strand nucleic acid ligands renowned for their high affinity and specificity to target molecules. Traditionally they are identified from large DNA/RNA libraries using in vitro methods, like Systematic Evolution of Ligands by Exponential Enrichment (SELEX). However, these libraries capture only a small fraction of theoretical sequence space, and various aptamer candidates are constrained by actual sequencing capabilities from the experiment. Addressing this, we proposed AptaDiff, the first in silico aptamer design and optimization method based on the diffusion model. Our Aptadiff can generate aptamers beyond the constraints of high-throughput sequencing data, leveraging motif-dependent latent embeddings from variational autoencoder, and can optimize aptamers by affinity-guided aptamer generation according to Bayesian optimization. Comparative evaluations revealed AptaDiff’s superiority over existing aptamer generation methods in terms of quality and fidelity across four high-throughput screening data targeting distinct proteins. Moreover, our de novo designed aptamers displayed enhanced binding affinity over the top SELEX-screened experimental candidates for two target proteins. The promising results demonstrate that our AptaDiff method can significantly expedite the superior aptamer discovery.
- Ubuntu == 20.04
- python == 3.8
- pytorch == 1.9.1
- cuda 11.1
The initial stage is to train a VAE to learn the low-dimensional motif-dependent aptamer representation.
python vae/scripts/real.py data/raw_data/datasetA_IGFBP3_P6.csv \
0.001 \
The vae model is saved in vae/out/trained_vae/datasetA_IGFBP3_P6_vae.mdl
To embed the sequence, use encode.py
, which input sequences and trained model and output sequences' representation vector. While the VAE model encodes the sequence into the latent space in the form of distribution, the output representation vector is the center of this distribution.
Run:
python vae/scripts/encode.py data/raw_data/datasetA_IGFBP3_P6.csv \
vae/out/trained_vae/datasetA_IGFBP3_P6_vae.mdl
This will output sequences' representation vector in the following format:
index,seq,dim1,dim2
0,CGACATGGGCCGCCCAAGGA,0.56,0.38
1,GCGTACCGTAAATCTGTCGG,0.18,0.34
...
The default saving path is vae/out/encode/embed_datasetA.csv
.
We convert the file vae/out/encode/embed_datasetA.csv
into the input format of the diffusion model. The default path is data/diffusion_data/datasetA_IGFBP3_x_z.csv
.
Run:
python diffusion/train.py --data_path data/diffusion_data/datasetA_IGFBP3_x_z.csv \
--dataset datasetA \
--batch_size 32 \
--update_freq 1 \
--lr 0.0001 \
--epochs 1000 \
--eval_every 2 \
--check_every 20 \
--diffusion_steps 1000 \
--transformer_dim 512 \
--transformer_heads 16 \
--transformer_depth 12 \
--transformer_blocks 1 \
--transformer_local_heads 8 \
--transformer_local_size 1 \
--gamma 0.99 \
--log_wandb True \
The aptadiff model is saved in:diffusion/out/datasetA/aptadiff_z/.../check/checkpoint.pt
python vae/scripts/gmm.py data/raw_data/datasetA_IGFBP3_P6.csv \
vae/out/trained_vae/datasetA_IGFBP3_P6_vae.mdl \
8
The output file is saved in: vae/out/gmm
python diffusion/eval_sample.py --check_path diffusion/out/datasetA/aptadiff_z/.../ \
--data_path data/diffusion_data/datasetA_IGFBP3_x_z.csv \
--eval_path vae/data/sampling_data/datasetA_IGFBP3/gmm_seq.csv \
--samples 8 \
--length 36
The sequence generated by sampling is saved in results/datasetA_IGFBP3/samples/datasetA_gmm.txt
python vae/scripts/bo.py data/raw_data/datasetA_IGFBP3_P6.csv \
vae/out/trained_vae/datasetA_IGFBP3_P6_vae.mdl \
data/spr_data/datasetA_IGFBP3_gmm_RU \
8
The output file is saved in:vae/out/bo
Then run diffusion/eval_sample.py
to get the BO optimized sequence
python diffusion/eval_sample.py --check_path results/datasetA_IGFBP3 \
--data_path data/diffusion_data/datasetA_IGFBP3_x_z.csv \
--eval_path vae/data/sampling_data/datasetA_IGFBP3/bo_seq.csv \
--samples 8 \
--length 36
The sequence generated by sampling is saved in results/datasetA_IGFBP3/samples/datasetA_bo.txt