Skip to content

Training UniversalEPI

Aayush Grover edited this page Mar 24, 2025 · 4 revisions

Note: Two sets of trained UniversalEPI models can be found on Zenodo. One set is trained on GM12878 and K562 cell lines whereas the other is trained on IMR90 and HepG2 cell lines. By design, UniversalEPI can make accurate predictions for unseen cell types. To make predictions using a pre-trained model, please see here.

Step 1: Data Preprocessing

ATAC-seq Data Processing

We need to do cross-cell-type normalization using the GM12878 cell line as the reference.

  1. Ensure that GM12878 ATAC-seq bigwig is present in data/atac/raw

  2. Enter the directory with preprocessing scripts

cd preprocessing/atac
  1. Run the normalization script

    • If you have bam files as input
    python normalize_atac.py -p ../../data/atac/raw/ --input_bam ../../data/atac/raw/<cell_line1>.bam ../../data/atac/raw/<cell_line2>.bam 
    
    • If you have bigwig and peak files as input
    python normalize_atac.py -p ../../data/atac/raw/ --input_bw ../../data/atac/raw/<cell_line1>.bigWig ../../data/atac/raw/<cell_line2>.bigWig --input_bed ../../data/atac/raw/<cell_line1>.bed ../../data/atac/raw/<cell_line2>.bed
    

    where <cell_line1> and <cell_line2> are the names of your cell lines/conditions.

This will create the normalized bigwig files data/atac/raw/<cell_line1>_normalized.bw, data/atac/raw/<cell_line2>_normalized.bw and deduplicated peak files data/atac/raw/<cell_line1>_dedup.bed, data/atac/raw/<cell_line2>_dedup.bed.

While the above example shows how to run the script when you have two cell lines, the script can be run for any number of cell lines.


Hi-C Data Processing

  1. Enter the directory with preprocessing scripts
cd preprocessing/hic
  1. (Optional) Convert .hic or .cool files to pairwise interactions

    • If you want to obtain the pairwise interactions from ../data/hic/K562.hic (as an example) at 5Kb resolution with ICE normalization.
    ./hic2sparse.sh ../../data/hic/K562.hic ../../data/hic/k562 5000 --ice
    

    Remove --ice if the input file is already ICE-normalized.

    • Similarly, if you want to obtain the pairwise interactions from ../data/hic/K562.cool (as an example) with ICE normalization.
    ./cool2sparse.sh ../../data/hic/K562.cool ../../data/hic/k562 --ice
    

The output pairwise interaction files (one per chromosome) will be stored in ../../data/hic/k562/raw_iced/. Each file will be tab-separated and have three columns: pos1, pos2, and hic_score.

  1. Cross-cell-type normalization
    • Ensure that the GM12878 raw_iced files are placed in ../../data/hic/gm12878/raw_iced. These will be used as the reference for normalization.
    • Ensure that for all the cell lines of interest, pairwise interaction files for each chromosome are placed in ../../data/hic/<cell_line>/raw_iced/chr<chrom_number>_raw.bed
    • Apply normalization
    python normalize_hic.py --cell_lines <cell_line> --data_dir ../../data/hic
    
    By default, this script normalizes all autosomes, assumes a Hi-C resolution of 5Kb, and uses gm12878 as the reference cell line. These can be modified using appropriate flags.

Dataset Creation

Combine ATAC-seq and Hi-C to extract targets corresponding to ATAC peaks for each training cell line

python ./preprocessing/prepare_target_data.py --cell_line <cell_line> --atac_bed_path ./data/atac/raw/<cell_line>_dedup.bed --hic_data_dir ./data/hic/

This also saves the updated ATAC-seq peaks at data/atac/raw/<cell_line>_dedup_neg.bed with 10% pseudopeaks added. The above script will run for all autosomes (chr1-22) by default. The Hi-C resolution is assumed to be 5Kb. The hg38 genome version is considered by default. These can be modified using appropriate flags.


Step 2: Train Stage 1

  1. Create a new config file for each training cell line or condition in Stage1/. See ./Stage1/ for more details.
  2. Setting training cell lines in ./Stage1/configs/datamodule/validation/cross_cell.yaml, Stage 1 model can be trained by
python ./Stage1/train.py

Step 3: Extract Genomic Features from Stage 1

For each training cell line <cell_line>,

  1. Ensure that <cell_line> is the prediction cell line in Stage1/configs/datamodule/validation/cross-cell.yaml
  2. Store the genomic inputs for each cell line <cell_line>
python ./Stage1/store_inputs.py --cell_line <cell_line>

This will store parquet files containing DNA-sequence, ATAC-seq, and mappability data at data/stage1_outputs/predict_<cell_line>/. By default, all chromosomes will be used. To use a subset of chromosomes, mention the chromosomes under "chromosome: predict:" in ./Stage1/configs/datamodule/validation/cross_cell.yaml


Step 4: Train Stage 2