-
Notifications
You must be signed in to change notification settings - Fork 0
Training UniversalEPI
Note: Two sets of trained UniversalEPI models can be found on Zenodo. One set is trained on GM12878 and K562 cell lines whereas the other is trained on IMR90 and HepG2 cell lines. By design, UniversalEPI can make accurate predictions for unseen cell types. To make predictions using a pre-trained model, please see here.
We need to do cross-cell-type normalization using the GM12878 cell line as the reference.
-
Ensure that GM12878 ATAC-seq bigwig is present in
data/atac/raw -
Enter the directory with preprocessing scripts
cd preprocessing/atac
-
Run the normalization script
- If you have bam files as input
python normalize_atac.py -p ../../data/atac/raw/ --input_bam ../../data/atac/raw/<cell_line1>.bam ../../data/atac/raw/<cell_line2>.bam- If you have bigwig and peak files as input
python normalize_atac.py -p ../../data/atac/raw/ --input_bw ../../data/atac/raw/<cell_line1>.bigWig ../../data/atac/raw/<cell_line2>.bigWig --input_bed ../../data/atac/raw/<cell_line1>.bed ../../data/atac/raw/<cell_line2>.bedwhere <cell_line1> and <cell_line2> are the names of your cell lines/conditions.
This will create the normalized bigwig files data/atac/raw/<cell_line1>_normalized.bw, data/atac/raw/<cell_line2>_normalized.bw and deduplicated peak files data/atac/raw/<cell_line1>_dedup.bed, data/atac/raw/<cell_line2>_dedup.bed.
While the above example shows how to run the script when you have two cell lines, the script can be run for any number of cell lines.
- Enter the directory with preprocessing scripts
cd preprocessing/hic
-
(Optional) Convert .hic or .cool files to pairwise interactions
- If you want to obtain the pairwise interactions from
../data/hic/K562.hic(as an example) at 5Kb resolution with ICE normalization.
./hic2sparse.sh ../../data/hic/K562.hic ../../data/hic/k562 5000 --iceRemove
--iceif the input file is already ICE-normalized.- Similarly, if you want to obtain the pairwise interactions from
../data/hic/K562.cool(as an example) with ICE normalization.
./cool2sparse.sh ../../data/hic/K562.cool ../../data/hic/k562 --ice - If you want to obtain the pairwise interactions from
The output pairwise interaction files (one per chromosome) will be stored in ../../data/hic/k562/raw_iced/. Each file will be tab-separated and have three columns: pos1, pos2, and hic_score.
- Cross-cell-type normalization
- Ensure that the GM12878 raw_iced files are placed in
../../data/hic/gm12878/raw_iced. These will be used as the reference for normalization. - Ensure that for all the cell lines of interest, pairwise interaction files for each chromosome are placed in
../../data/hic/<cell_line>/raw_iced/chr<chrom_number>_raw.bed - Apply normalization
By default, this script normalizes all autosomes, assumes a Hi-C resolution of 5Kb, and usespython normalize_hic.py --cell_lines <cell_line> --data_dir ../../data/hicgm12878as the reference cell line. These can be modified using appropriate flags. - Ensure that the GM12878 raw_iced files are placed in
Combine ATAC-seq and Hi-C to extract targets corresponding to ATAC peaks for each training cell line
python ./preprocessing/prepare_target_data.py --cell_line <cell_line> --atac_bed_path ./data/atac/raw/<cell_line>_dedup.bed --hic_data_dir ./data/hic/
This also saves the updated ATAC-seq peaks at data/atac/raw/<cell_line>_dedup_neg.bed with 10% pseudopeaks added.
The above script will run for all autosomes (chr1-22) by default. The Hi-C resolution is assumed to be 5Kb. The hg38 genome version is considered by default. These can be modified using appropriate flags.
- Create a new config file for each training cell line or condition in
Stage1/. See./Stage1/for more details. - Setting training cell lines in
./Stage1/configs/datamodule/validation/cross_cell.yaml, Stage 1 model can be trained by
python ./Stage1/train.py
For each training cell line <cell_line>,
- Ensure that <cell_line> is the prediction cell line in
Stage1/configs/datamodule/validation/cross-cell.yaml - Store the genomic inputs for each cell line <cell_line>
python ./Stage1/store_inputs.py --cell_line <cell_line>
This will store parquet files containing DNA-sequence, ATAC-seq, and mappability data at data/stage1_outputs/predict_<cell_line>/. By default, all chromosomes will be used. To use a subset of chromosomes, mention the chromosomes under "chromosome: predict:" in ./Stage1/configs/datamodule/validation/cross_cell.yaml
- Ensure that genomic data (
./data/stage1_outputs/predict_{cell_line}) and Hi-C paths (./data/hic/) in./Stage2/configs/configs.yamlare correct. Then runpython ./Stage2/main.py --config_dir ./Stage2/configs/configs.yaml --mode train - If npz files are already generated using
create_dataset.pyandmerge_dataset.py, the data paths can be specified in./Stage2/configs/configs.yaml.