Training UniversalEPI

Note: Two sets of trained UniversalEPI models can be found on Zenodo. One set is trained on GM12878 and K562 cell lines whereas the other is trained on IMR90 and HepG2 cell lines. By design, UniversalEPI can make accurate predictions for unseen cell types. To make predictions using a pre-trained model, please see here.

Step 1: Data Preprocessing

ATAC-seq Data Processing

We need to do cross-cell-type normalization using the GM12878 cell line as the reference.

Ensure that GM12878 ATAC-seq bigwig is present in data/atac/raw
Enter the directory with preprocessing scripts

cd preprocessing/atac

Run the normalization script

If you have bam files as input

python normalize_atac.py -p ../../data/atac/raw/ --input_bam ../../data/atac/raw/<cell_line1>.bam ../../data/atac/raw/<cell_line2>.bam

If you have bigwig and peak files as input

python normalize_atac.py -p ../../data/atac/raw/ --input_bw ../../data/atac/raw/<cell_line1>.bigWig ../../data/atac/raw/<cell_line2>.bigWig --input_bed ../../data/atac/raw/<cell_line1>.bed ../../data/atac/raw/<cell_line2>.bed

where <cell_line1> and <cell_line2> are the names of your cell lines/conditions.

This will create the normalized bigwig files data/atac/raw/<cell_line1>_normalized.bw, data/atac/raw/<cell_line2>_normalized.bw and deduplicated peak files data/atac/raw/<cell_line1>_dedup.bed, data/atac/raw/<cell_line2>_dedup.bed.

While the above example shows how to run the script when you have two cell lines, the script can be run for any number of cell lines.

Hi-C Data Processing

Enter the directory with preprocessing scripts

cd preprocessing/hic

(Optional) Convert .hic or .cool files to pairwise interactions
- If you want to obtain the pairwise interactions from ../data/hic/K562.hic (as an example) at 5Kb resolution with ICE normalization.
```
./hic2sparse.sh ../../data/hic/K562.hic ../../data/hic/k562 5000 --ice
```
Remove --ice if the input file is already ICE-normalized.
- Similarly, if you want to obtain the pairwise interactions from ../data/hic/K562.cool (as an example) with ICE normalization.
```
./cool2sparse.sh ../../data/hic/K562.cool ../../data/hic/k562 --ice
```

The output pairwise interaction files (one per chromosome) will be stored in ../../data/hic/k562/raw_iced/. Each file will be tab-separated and have three columns: pos1, pos2, and hic_score.

Cross-cell-type normalization
- Ensure that the GM12878 raw_iced files are placed in ../../data/hic/gm12878/raw_iced. These will be used as the reference for normalization.
- Ensure that for all the cell lines of interest, pairwise interaction files for each chromosome are placed in ../../data/hic/<cell_line>/raw_iced/chr<chrom_number>_raw.bed
- Apply normalization
```
python normalize_hic.py --cell_lines <cell_line> --data_dir ../../data/hic
```
By default, this script normalizes all autosomes, assumes a Hi-C resolution of 5Kb, and uses gm12878 as the reference cell line. These can be modified using appropriate flags.

Dataset Creation

Combine ATAC-seq and Hi-C to extract targets corresponding to ATAC peaks for each training cell line

python ./preprocessing/prepare_target_data.py --cell_line <cell_line> --atac_bed_path ./data/atac/raw/<cell_line>_dedup.bed --hic_data_dir ./data/hic/

This also saves the updated ATAC-seq peaks at data/atac/raw/<cell_line>_dedup_neg.bed with 10% pseudopeaks added. The above script will run for all autosomes (chr1-22) by default. The Hi-C resolution is assumed to be 5Kb. The hg38 genome version is considered by default. These can be modified using appropriate flags.

Step 2: Train Stage 1

Create a new config file for each training cell line or condition in Stage1/. See ./Stage1/ for more details.
Setting training cell lines in ./Stage1/configs/datamodule/validation/cross_cell.yaml, Stage 1 model can be trained by

python ./Stage1/train.py

Step 3: Extract Genomic Features from Stage 1

For each training cell line <cell_line>,

Ensure that <cell_line> is the prediction cell line in Stage1/configs/datamodule/validation/cross-cell.yaml
Store the genomic inputs for each cell line <cell_line>

python ./Stage1/store_inputs.py --cell_line <cell_line>

This will store parquet files containing DNA-sequence, ATAC-seq, and mappability data at data/stage1_outputs/predict_<cell_line>/. By default, all chromosomes will be used. To use a subset of chromosomes, mention the chromosomes under "chromosome: predict:" in ./Stage1/configs/datamodule/validation/cross_cell.yaml

Step 4: Train Stage 2

Ensure that genomic data (./data/stage1_outputs/predict_{cell_line}) and Hi-C paths (./data/hic/) in ./Stage2/configs/configs.yaml are correct. Then run
```
python ./Stage2/main.py --config_dir ./Stage2/configs/configs.yaml --mode train
```
If npz files are already generated using create_dataset.py and merge_dataset.py, the data paths can be specified in ./Stage2/configs/configs.yaml.

Getting Started

Usage

Miscellaneous

Citing UniversalEPI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training UniversalEPI

Step 1: Data Preprocessing

ATAC-seq Data Processing

Hi-C Data Processing

Dataset Creation

Step 2: Train Stage 1

Step 3: Extract Genomic Features from Stage 1

Step 4: Train Stage 2

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting Started

Usage

Miscellaneous

Clone this wiki locally