GEM

developed by: Elliot Chan
contact: e224chan@uwaterloo.ca

A sophisticated ML-enhanced end-to-end gene variant pathogenicity prediction and repair system. Aiming to accelerate gene variant classification and gene therapy target identification via intelligent, guided mutations.

What is this GEM?

GEM (Gene Edit Machine) is a machine learning-powered platform designed to classify and optimize gene variants for reduced pathogenicity. What makes this a GEM?

integrates ClinVar and GRCh38 human genome assembly data to build detailed, context-inclusive sequences
Extracts 500+ features using my original DNA sequence fingerprinting modules: CompositeDNA, CompositeProt, DNAMatrix, ProtMatrix
uses biochemical data and research-backed feature extraction at both the DNA and protein level
detects and extracts regulatory region disruptions in various domains on both DNA and protein levels
employs optimized state path and domain composition changes across multiple genomic scales
trains optimized XGBoost models to classify benign and pathogenic mutation variants
suggests guided edits to reduce variant pathogenicity

It includes:

LookingGlass: Variant classifier based on custom FASTA inputs
ReGen: Performs a combination of guided edits to increase the likelihood of benign classification, with intelligently placed stochastic mutations to escape performance plateaus

[1] Installation and Workflow

Installation and Setup

Go to the Releases Tab on the right hand side of this GitHub page
Download the latest release as a .zip file
Unzip the file on your computer
Open/Load up the folder using your preferred Python IDE (PyCharm, VSCode, or others)
In your python terminal, type in "pip install" followed by the list of required packages in the Prerequisites and Dependencies section of this README.md

Now you're ready to set up the project's database.

First, you're going to need to download ClinVar's variant data from the link below.

Go to https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/ then click on "variant_summary.txt.gz"

Then, you're going to download the GCF GRCh38 human genome assembly file as well as its assembly report.

Go to https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/ then download the following:
GCF_000001405.40_GRCh38.p14_genomic.fna.gz
GCF_000001405.40_GRCh38.p14_assembly_report.txt

Move all of those into the 'database' folder. Now we are ready to do your first run.

Workflow

All of this is on 00_main.py.

from src.gem.a01_KeyStone import KeyStone
from src.gem.a03_LookingGlass import LookingGlass
from src.gem.a04_ReGen import ReGen

model_name = "TestRun_Model"


# all functions with a name guard need to be called alone

# [1] sourced data extraction and processing
# - ensure the required files mentioned at the start are downloaded and in the database//datacore folder
def keystone_dataframe_processing(ml_modelname=model_name):
    test = KeyStone(ml_modelname)
    test.naive_dataframe()
    test.decompress_genome()
    test.context_dataframe()


# [2] feature extraction / engineering
# note: call each of these sequentially, they each use a multiproc. if name == "__main__" guard like so:
def keystone_extract_proteins(model=model_name):
    test = KeyStone(model)
    try:
        if __name__ == "__main__":
            success = test.protein_extraction()
            if success:
                print("[Protein Sequence Extraction Completed]")
    except Exception as e:
        print(f"Mutation Fingerprint Generation Failed: {e}")
        raise


def ks_dna_profile(model=model_name):
    # 14 minutes
    test = KeyStone(model)
    try:
        if __name__ == "__main__":
            success = test.generate_dna_profile()
            if success:
                print("[DNA Profile Completed]")
    except Exception as e:
        print(f"DNA Fingerprint Generation Failed: {e}")
        raise


def ks_prot_profile(model=model_name):
    test = KeyStone(model)
    try:
        if __name__ == "__main__":
            success = test.generate_prot_profile()
            if success:
                print("[Protein Profile Completed]")
    except Exception as e:
        print(f"Protein Fingerprint Generation Failed: {e}")
        raise


def ks_dnamotif_profile(model=model_name):
    # 1.5 hours
    test = KeyStone(model)
    try:
        if __name__ == "__main__":
            success = test.generate_dnapwm_profile()
            if success:
                print("[PWM Profile Completed]")
    except Exception as e:
        print(f"PWM Fingerprint Generation Failed: {e}")
        raise


def ks_domain_profile(model=model_name):
    # 1:50 hours
    test = KeyStone(model)
    try:
        if __name__ == "__main__":
            success = test.generate_hmm_profile()
            if success:
                print("[HMM-domain Profile Completed]")
    except Exception as e:
        print(f"HMM-domain Fingerprint Generation Failed: {e}")
        raise


def ks_aamotif_profile(model=model_name):
    # 30 minutes
    test = KeyStone(model)
    try:
        if __name__ == "__main__":
            success = test.generate_aapwm_profile()
            if success:
                print("[PWM Profile Completed]")
    except Exception as e:
        print(f"PWM Fingerprint Generation Failed: {e}")
        raise


# [3] Feature dataframe merging
def keystone_merge(ml_modelname=model_name):
    test = KeyStone(ml_modelname)

    try:
        if __name__ == "__main__":
            success = test.get_final_dataframe()
            if success:
                print("[Fingerprint DataFrame Generation Completed]")
    except Exception as e:
        print(f"Mutation Fingerprint Generation Failed: {e}")
        raise


# [4] Model training and optimization
def keystone_model_training(ml_modelname=model_name):
    # full run will take around 6 hours - latest optimized hyperparameters will be saved in model_name_stats.txt files
    test = KeyStone(ml_modelname)
    test.train_models()


def LookingGlass_Demo(fasta_filename='test.fasta', ml_model=model_name, output_filename='Screen_test_1'):
    test_module = LookingGlass(fasta_filename, ml_model)
    if __name__ == "__main__":
        test_module.predict_file(output_filename)


def Repair_Gene(pathogenic_gene_file='benchmark_fasta', ml_model=model_name, outfile_name='benchmark_repair_test'):
    if __name__ == "__main__":
        module = ReGen(pathogenic_gene_file, ml_model, outfile_name)
        module.repair()

[2] Model Stats: Clinical and Discriminator

After conducting several hyperparameter runs I noticed that two configurations gave me peaks in performance for different use cases. Both models use identical feature sets but the difference is in XGBoost hyperparams. optimized via Optuna (settings in stats.txt files)

[[ClinicalModel]] - this model possesses the lowest test-set False Negatives while retaining high-performance metrics
It's greater sensitivity to pathogenic variants == greater relevancy for use in clinical tasks

Cross Validation Results: Mean ROC AUC: 0.8980, Mean PR AUC: 0.8932
Mean FNs: 5499.40, Mean FPs: 5386.00
ROC AUC: 0.8999  
Precision-Recall AUC: 0.8960
Pathogenic F1-Score: 0.8050
Optimal threshold for pathogenic detection: 0.507
Performance with optimal threshold:
              precision    recall  f1-score   support

           0       0.84      0.83      0.83     41774
           1       0.80      0.81      0.81     34814

    accuracy                           0.82     76588
   macro avg       0.82      0.82      0.82     76588
weighted avg       0.82      0.82      0.82     76588
Confusion Matrix:
[[34572  7202]
 [ 6491 28323]]

[[DiscriminatorModel]] - highest discrimination power, FN and FP are almost perfectly balanced while AUC and other metrics are greatest
Cross Validation Results: Mean ROC AUC: 0.8985, Mean PR AUC: 0.8938
Mean FNs: 5411.20, Mean FPs: 5466.60
ROC AUC: 0.9003
Precision-Recall AUC: 0.8967
Pathogenic F1-Score: 0.8059
Optimal threshold for pathogenic detection: 0.496
Performance with optimal threshold:
              precision    recall  f1-score   support

           0       0.84      0.84      0.84     41774
           1       0.81      0.81      0.81     34814

    accuracy                           0.82     76588
   macro avg       0.82      0.82      0.82     76588
weighted avg       0.82      0.82      0.82     76588
Confusion Matrix:
[[35023  6751]
 [ 6747 28067]]

[3] LookingGlass example FASTA Input and Results

Users must provide FASTA sequences in this format to gene_databank:

>chr'X'| ref / alt / flank1 / flank2 | gene_name

Remember that flanking context sequences (ideally 500bp) must be provided
If flanking sequences are under 500, nothing's going to break, but the lack of context will most likely bottleneck the model's prediction power.

BenchmarkGene1 is the first pathogenic gene variant in our dataset: ID NC_000007.14
I have trimmed it down for this README.md, the full sequence is in the file

>chr9|ref|BenchmarkGene1
GCTGCTGGACCTGCC

>chr9|alt|BenchmarkGene1
G

>chr9|flank1|BenchmarkGene1
ACACCTGTAATCCCAGCACCTCGGGAGGCCAAGGCAGGAGGATCTTGAGGCCAGGAGTTCAAGACCAGCC

>chr9|flank2|BenchmarkGene1
CTGCTTGACGGCGGTGCTGGACCTGCAGCTCAGGTGGGCCCCTCACCCTCTGCCAGCGCTGCGTCT

LookingGlass output from the test gene file

Name,Predicted_Class,Prob_Benign,Prob_Pathogenic
benchmarkgene1,1,0.024691403,0.9753086
benchmarkgene2,1,0.027637959,0.97236204

ReGen example Input and Results

Users need to insert a FASTA file of the same custom format in the ReGen_input folder

IMPORTANT NOTE
Due to substantial feature engineering improvements between versions, ReGen optimization runs changed significantly. Example below does not include benign-threshold variants as those were unable to be found with ReGen's current iteration, unlike previous versions.

Results changed due to massive improvements and changes. Addition of regulatory disruption detections, domain boundary shifts and repeat instability patterns that were not detected in previous models are now present. This means that "easy" sequence changes that were thought to be benign are now correctly identified as potentially disruptive. The optimization algorithm must now navigate a more biologically realistic but more challenging mutation landscape to effectively reduce pathogenicity. Although functional, ReGen is still a work in progress and is more of a proof-of-concept more than a proper gene-repair algorithm.

================================================================================
ReGen Analysis Results: ClinicalModel | benchmark_fasta | benchmarkgene1
================================================================================

ORIGINAL VARIANT STATS: 
Ref Sequence: GCTGCTGGACCTGCC
Alt Sequence: G
Benign % chance: 2.469140

ANALYSIS SUMMARY:
|- Starting Score: 0.024691
|- Original Length: 15 bp
|- Final Variants: 1
|- Benign Threshold Variants: 0
|- ReGen config: 20 iterations, 1 copies

MAX BENIGN VARIANTS PER ITERATION:
--------------------------------------------------
Score: 11.542880535125732 | Length: 3 bp
Benign % increase: 9.073740243911743
   Sequence:
    GTT

Score: 22.84235954284668 | Length: 6 bp
Benign % increase: 20.37321925163269
   Sequence:
    GTTATC

Note: Top performing genes from every iteration are listed here, I'm skipping the others as there was no improvement


FINAL VARIANTS:
--------------------------------------------------
Score: 22.84235954284668 | Length: 6 bp
Benign % increase: 20.37321925163269
   Sequence: 
    GTTATC

Multiple copies can be run simultaneously, I just personally chose to run with 1 for ease of debugging, as it was easier to keep track of the 1. More copies and iterations generally will get better results.

Configuration

All parameters are configurable via 'config.yaml', this is a demo section:

# ========================================
# Mutation Profile Fingerprinting Settings
# ========================================

# [DNA similarity matrix scores]
match: 1
mismatch: -1

# [K-mer size limits]
k_min: 2  # size limits on possible kmers
k_max: 7

# [Ambiguous string generation number]
variant_num: 3  # how many DNA variants we are willing to consider for ambiguous sequences

# [Functional Consequence Settings]
splice_site_window: 8
motif_window: 4

# [DNA repeat settings]
min_repeat_length: 2  # minimum size of repeat subsequences
max_repeat_length: 10
min_repeats: 5  # minimum times a subseq is repeated for its region to be considered a microsatellite

# [Protein settings]
protein_substitution_matrix: "blosum62.mat"
minimum_protein_length: 10
protein_gap_penalty: -8

Prerequisites and Dependencies

Python 3.7+
Required packages:
- pandas
- numpy
- scikit-learn
- xgboost
- biopython
- tqdm
- optuna
- pyfaidx

References:

Chromosome densities:
- https://www.cshlp.org/ghg5_all/section/dna.shtml#:~:text=Gene%20density%20varies%20greatly%20among,x-axis%20of%20the%20figure.
Nearest-neighbour thermodynamic parameters:
- https://pubs.acs.org/doi/10.1021/bi951907q
Kozak motif info:
- https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/kozak-consensus-sequence
AA Molecular Weights
- https://www.thermofisher.com/ca/en/home/references/ambion-tech-support/rna-tools-and-calculators/proteins-and-amino-acids.html#:~:text=GGG%20Gly%20(G),amino%20acids%20x%20110%20Da
Isoelectric Points
- https://www.peptideweb.com/images/pdf/pKa-and-pI-values-of-amino-acids.pdf
Hydrophobicities
- https://www.sigmaaldrich.com/CA/en/technical-documents/technical-article/protein-biology/protein-structural-analysis/amino-acid-reference-chart?srsltid=AfmBOor2d8A7dJO8Hd-g3cDQimh8Z9IXz-jEcHic7DaGYH1OXi5oUriO
Half Lives
- https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/650/Estimated_half_life.html

Lengthy technical breakdown of major modules below

Technical Details

KeyStone: Integrating Database Querying, Bioinformatics Analyses and ML pipelines

Extracts genomic context, generates both DNA and Protein-based biochemical + structural features, and trains optimized models for clinical variant interpretation

Data processing
- Quality Filtering: removes uncertain classifications and focuses on high-confidence Benign/Pathogenic variants
- Context Extraction: configurable flanking sequence windows around variants
Feature Engineering is done through classes CompositeDNA, CompositeProt, DNAMatrix and ProtMatrix, elaborated on below
Machine Learning Optimization and Evaluation
- XGB classifiers undergo Optuna hyperparameter tuning (pre-computed optimal settings provided in pipeline)
- Stratified k-fold validation, weighted sampling, threshold optimization
- ROC-AUC, PR-AUC, F1-scores, Confusion Matrices, and Reports on trained model provided in model_folder

CompositeDNA: High-performance comprehensive DNA Analysis

Conducts thorough DNA analyses to build a numerical profile on the following features:

Sequence Alignment and Comparison
- Needleman-Wunsch global alignment, Smith-Waterman local alignment, Sequence identity and Coverage metrics
Structural Analysis
- Chromosome density
- Length changes
- Thermodynamic stability analysis using nearest-neighbour parameters
- GC content and CpG island disruption analysis
- Shannon entropy for sequence complexity
- Hairpin structure prediction
- Intron detections
Mutation Detection
- SNV (Single Nucleotide Variant) identification
- Insertion / Deletion analysis with frameshift detection
- Tandem duplication discovery
Regulatory Element Analysis
- Transcription factor motif disruptions
- DNA instability motif analysis
- Start/stop codon changes

Most notable additions to CompositeDNA:

HMM Genomic Domain Analysis

Predicts alterations to the genomic regulatory landscape through probabilistic state modeling

The HMM module translates DNA sequences into their optimized state paths using the Viterbi algorithm. From this optimized state path, it tracks:

Domain Composition Shifts: How variants change the proportion of coding, noncoding, regulatory, transcribable domains
State Transition Disruptions: Detection of domain boundary changes, Gaussian decay-weighted by proximity to mutation sites
Multi-Scale Monitoring: Three analysis windows capture local and broad domain effects

Repeat Instability Analysis - (Enhanced Microsatellite Detection)

Goes beyond repeat counting to capture biological mechanisms like expansions, contractions, and slippage-mediated mutagenesis

Loci Tracking: detects complete loss and gain of repeat regions
Expansion / Contraction: quantifies bp changes within existing repeat loci
Weighted Scoring: larger repeat units weighted more heavily
Composite Scoring: Integrates multiple disruption types to capture synergistic effects

CompositeProt: High-performance comprehensive Amino-Acid Chain / Protein Analysis

Performs smart ORF detection to uncover the highest-probability protein sequence upon which the following features are extracted:

Translation and ORF Detection
- Intelligent Open Reading Frame identification
- Kozak motif sequence and ORF length-based scoring for translation efficiency
- Bidirectional sequence analysis (accounts for reverse complement)
- Ambiguous nucleotide handling
Physicochemical Properties
- Molecular weight calculations
- Net charge at physiological pH
- Isoelectric point determination
- Hydrophobicity changes
- Aliphatic index
- Protein half-life predictions
Structural Analysis
- Secondary structure propensity analysis (BioPython-aided)
- Proline content assessment
- Global and local protein sequence alignment with BLOSUM64 matrix
Feature Interaction Engineering includes:
- Composite scores combining multiple properties
- Interaction terms (e.g. charge-hydrophobicity)
- Relative metrics normalized by protein length

DNAMatrix - regulatory motif analysis

Captures how variants disrupt DNA-level regulatory architecture through multi-dimensional scoring of transcriptional and post-transcriptional control.

Scans variety of motifs found in 3 regulatory domains:

initiation signals: initiator, TATA box, Kozak Sequence
transcription factors: CTCF, CAAT, SP1, NF-kB, AP1, CREB
post-transcriptional regulatory elements: 5’/3’ splice sites, branch points, polyadenylation signals

Each motif is scored using position-weight matrices with a Gaussian-weighted distance decay function that assigns higher impact to disruptions near the variant site. The module calculates three complementary metrics across individual motif, domain, and total DNA levels:

raw motif count changes
position-weighted score shifts
cluster composite scores to detect synergistic effects when multiple binding sites are disrupted within close proximity. The distance threshold is configurable (30bp default).

By aggregating individual motif disruptions into domain-level cluster scores, the system identifies coordinated regulatory breakdowns that simple count-based methods would miss.

ProtMatrix - protein motif analysis

Extends variant impact assessment to the protein level, analyzing amino acid changes and disruptions in post-translational modification sites and protein-protein interaction domains. Profiles four major regulatory systems:

phosphorylation sites: CDK, cAMP-PKA, CK2, Tyrosine Kinase
glycosylation sites: N-glycosylation consensus sequences
ubiquitination signals: D-box, KEN-box degrons
interaction domains: SH2 family, 14-3-3 binding sites, pdz domains, nuclear localization and export signals

Since protein sequences are extrapolated without genomic coordinates but from translation likelihood calculations, the module employs PWM scanning combined with regex pattern matching for canonical motifs. Similar cluster composite scoring is implemented, capturing the disruption of spatially clustered PTM sites or binding domains.

ReGen: Therapeutic Optimization

Utilizes a sophisticated algorithm system to optimize variants towards benign classification

Guided Evolution: intelligent sequence optimization using both guided and stochastic mutations
Multi-copy evolution: configurable parallel optimization of multiple sequence variants for balance between exploration / exploitation
Adaptive strategy: switches between guided and random mutations based on optimization progress
Comprehensive tracking: monitors the best variants per iteration and threshold-crossing variants

Shield:

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
DataSift_configs		DataSift_configs
ReGen_input		ReGen_input
ReGen_output/benchmark_repair_test_repair_predictions		ReGen_output/benchmark_repair_test_repair_predictions
Screen_results		Screen_results
database		database
gene_databank		gene_databank
model_folder		model_folder
.gitattributes		.gitattributes
00_main.py		00_main.py
CHANGELOG.md		CHANGELOG.md
DataSift.py		DataSift.py
LICENSE		LICENSE
README.md		README.md
a01_KeyStone.py		a01_KeyStone.py
a02_1_CompositeDNA_Toolkit.py		a02_1_CompositeDNA_Toolkit.py
a02_2_CompositeProt_Toolkit.py		a02_2_CompositeProt_Toolkit.py
a02_3_DNAMatrix_Toolkit.py		a02_3_DNAMatrix_Toolkit.py
a02_4_ProtMatrix_Toolkit.py		a02_4_ProtMatrix_Toolkit.py
a03_LookingGlass.py		a03_LookingGlass.py
a04_ReGen.py		a04_ReGen.py
b00_bio_library.py		b00_bio_library.py
b01_utility.py		b01_utility.py
config.yaml		config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEM

What is this GEM?

[1] Installation and Workflow

Installation and Setup

Workflow

[2] Model Stats: Clinical and Discriminator

[3] LookingGlass example FASTA Input and Results

ReGen example Input and Results

Configuration

Prerequisites and Dependencies

References:

Technical Details

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GEM

What is this GEM?

[1] Installation and Workflow

Installation and Setup

Workflow

[2] Model Stats: Clinical and Discriminator

[3] LookingGlass example FASTA Input and Results

ReGen example Input and Results

Configuration

Prerequisites and Dependencies

References:

Technical Details

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages