NovGMDeep is a deep learning model designed for genomic selection, specifically focusing on predicting phenotypes using novel genomic markers (SVs and TES). This project aims to address the challenge of high dimensionality in genomic marker data by utilizing a one-dimensional deep convolutional neural network. The model employs convolutional, pooling, and dropout layers to mitigate overfitting and reduce complexity introduced by a large number of genomic markers. The model has been trained and evaluated using Arabidopsis thaliana and Oryza sativa samples, employing K-Fold cross-validation. The prediction accuracy is evaluated using Pearson’s correlation coefficient (PCC), Mean absolute error (MAE), and Standard deviation of MAE. The predicted results for the phenotypes showed a higher correlation when the model was trained with SVs and TEs than with SNPs.
Ensure you have Python 3.9 installed. Install required packages using:
pip install -r requirements.txt
- Access the full VCF variant files containing structural variants data for A. thaliana samples from the European Variation Archive (PRJEB38975).
- Download the zipped folder containing CSV files with structural variants data: 'Deletions.csv', 'Duplications.csv', and 'Inversions.csv'.
- Phenotype data for Flowering time of A. thaliana samples can be found in the file "FT10_arabi.csv".
- TE genotype file for O. sativa. The three values -1, 0, and 1 indicate '1/1', '0/1', and '0/0'.
- SNP genotype file for O. sativa
- Associated phenotypic values for O. sativa
- Data Preprocessing
Select high-quality genotypes: Refer to
quality_based_selection.ipynb
.
Prepare data for model input: Refer todata_processing.ipynb
.
- Data Split
Split training and testing datasets: Execute
sv_data_split.py
.
- Train the Model
Train the model: Execute
sv_model_train.py
.
- Test the Model
Test the trained model: Execute
sv_model_train.py
.
If you use this work in your research, please cite:
@article{sehrawat2023predicting,
title={Predicting phenotypes from novel genomic markers using deep learning},
author={Sehrawat, Shivani and Najafian, Keyhan and Jin, Lingling},
journal={Bioinformatics Advances},
volume={3},
number={1},
pages={vbad028},
year={2023},
publisher={Oxford University Press}
}