soybean-ai

A Framework for Soybean Phenotype Prediction and Salient Loci Mining via Machine Learning and Interpretability Analysis

🌐 Project Website: https://soybean.starhelix.cn/
We have deployed the prediction models on this website for easy access and use by users.

Soybean Genomic Analysis and Prediction Project

This project aims to predict soybean oil content, protein content, and water-soluble protein content based on Single Nucleotide Polymorphism (SNP) data using various machine learning (ML) and deep learning (DL) models. It also includes SHAP (SHapley Additive exPlanations) for model interpretability analysis, and performs GWAS (Genome-Wide Association Studies) analysis on the SNP data.

Project Structure

The project contains the following folders and files:

📌 Main Features

1. ML Folder

Contains machine learning models such as Support Vector Machines, Random Forests, XGBoost, etc.
Models predict soybean oil content, protein content, and water-soluble protein content.
Processes SNP data and trains/validates predictive models.
Supports three SNP encoding strategies:

🔹 Base-type Encoding
- Based on .ped files.
- Uses 4-dimensional encoding for A, T, C, G.
- Homozygotes: 2 on respective base; Heterozygotes: 1 on both; Missing: all zeros.
- Example: AA → (2, 0, 0, 0), CG → (0, 0, 1, 1)
🔹 Haplo-type Encoding
- Based on .ped files.
- Homozygotes are one-hot encoded.
- Heterozygotes and missing values are all zeros.
- Example: AA → (1, 0, 0, 0), CG → (0, 0, 0, 0)
🔹 Gene-type Encoding
- Based on .raw files.
- Uses 0 (major homozygote), 1 (heterozygote), 2 (minor homozygote).
- Then one-hot encoded before being input into the model.

2. DL Folder

Deep learning models (e.g., neural networks) for phenotype prediction.
Trains on SNP data similarly to ML models.
PyTorch version used: 1.12.1

3. SHAP Folder

Uses SHAP to analyze the best-performing Support Vector Regression (SVR) model.
Helps explain feature importance and the contribution of each SNP to predictions.

4. data_processing Folder

Performs GWAS (Genome-Wide Association Studies) on SNP data.
Includes preprocessing, statistical analysis, and result generation.
Make sure to adjust tool paths according to your local setup.
R scripts rely on the CMplot package for visualization.

🧪 Requirements

Python ≥ 3.8
PyTorch == 1.12.1
scikit-learn
xgboost
pandas, numpy
matplotlib, seaborn
SHAP
R + CMplot package (for GWAS visualization)

📬 Contact

For questions or collaboration, feel free to reach out!

⭐ If you find this project helpful, consider giving it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
code		code
Data_Train_Test.rar		Data_Train_Test.rar
PCFigure1.png		PCFigure1.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

soybean-ai

Soybean Genomic Analysis and Prediction Project

Project Structure

📌 Main Features

1. ML Folder

🔹 Base-type Encoding

🔹 Haplo-type Encoding

🔹 Gene-type Encoding

2. DL Folder

3. SHAP Folder

4. data_processing Folder

🧪 Requirements

📬 Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

quietbamboo/soybean-ai

Folders and files

Latest commit

History

Repository files navigation

soybean-ai

Soybean Genomic Analysis and Prediction Project

Project Structure

📌 Main Features

1. ML Folder

🔹 Base-type Encoding

🔹 Haplo-type Encoding

🔹 Gene-type Encoding

2. DL Folder

3. SHAP Folder

4. data_processing Folder

🧪 Requirements

📬 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages