minimod is a simple tool for handling base modifications. It takes an aligned BAM file with modifications tags and the reference FASTA as inputs, and outputs base modifications (TSV) or base modification frequencies (TSV or bedmethyl).
Minimod reads base modification information encoded under MM:Z
and ML:B:C
SAM tags specified in SAMtags specification.
IMPORTANT: minimod is currently in early development stage. So note that the interface, thresholds and defaults may change. Watch out for bugs and open an issue if you find one. This notice will be removed when things are stable and undergo more rigorous testing
- Installation
- Usage
- Examples
- minimod view
- minimod mod-freq
- Modification codes and contexts
- Modification threshold
sudo apt-get install zlib1g-dev # install zlib development libraries
VERSION=v0.2.0
wget https://github.com/warp9seq/minimod/releases/download/$VERSION/minimod-$VERSION-release.tar.gz
tar xvf minimod-$VERSION-release.tar.gz
cd minimod-$VERSION/
scripts/install-hts.sh # download and compile the htslib
make
Usage information can be printed using minimod -h
command.
Usage: minimod <command> [options]
command:
view view base modifications
mod-freq output base modifications frequencies
# view all 5mC methylations at CG context in tsv format (default mod code: m, context:CG)
minimod view ref.fa reads.bam > mods.tsv
# 5mC methylation frequencies at CG context in tsv format (default mod code: m, threshold: 0.8, context:CG)
minimod mod-freq ref.fa reads.bam > modfreqs.tsv
# 5mC methylation frequencies at CG context in bedmethyl format (default mod code: m, threshold: 0.8, context:CG)
minimod mod-freq -b ref.fa reads.bam > modfreqs.bedmethyl
# modification frequencies of multiple types ( m (5-methylcytosine) and h (5-hydroxymethylcytosine) in CG context with thresholds 0.8 and 0.7 respectively )
minimod mod-freq -c m[CG],h[CG] -m 0.8,0.7 ref.fa reads.bam > mods.tsv
minimod view ref.fa reads.bam > mods.tsv
This writes all base modifications (default modification code "m") to a file (mods.tsv) in tsv format. Sample output is given below.
Usage: minimod view ref.fa reads.bam
basic options:
-c STR modification code(s) (eg. m, h) [m]
-t INT number of processing threads [8]
-K INT batch size (max number of reads loaded at once) [512]
-B FLOAT[K/M/G] max number of bases loaded at once [20.0M]
-h help
-p INT print progress every INT seconds (0: per batch) [0]
-o FILE output file [stdout]
--verbose INT verbosity level [4]
--version print version
Sample mods.tsv output
ref_contig ref_pos strand read_id read_pos mod_code mod_prob
chr22 19979864 + m84088_230609_030819_s1/55512555/ccs 14 m 0.709804
chr22 19979882 + m84088_230609_030819_s1/55512555/ccs 32 m 0.949020
chr22 19979885 + m84088_230609_030819_s1/55512555/ccs 35 m 0.980392
chr22 19979888 + m84088_230609_030819_s1/55512555/ccs 38 m 0.780392
chr22 19979900 + m84088_230609_030819_s1/55512555/ccs 50 m 0.623529
chr22 19979902 + m84088_230609_030819_s1/55512555/ccs 52 m 0.992157
chr22 19979929 + m84088_230609_030819_s1/55512555/ccs 79 m 0.941176
chr22 19979939 + m84088_230609_030819_s1/55512555/ccs 89 m 0.141176
chr22 19979948 + m84088_230609_030819_s1/55512555/ccs 98 m 0.623529
Field | Type | Definition |
---|---|---|
1. ref_contig | str | chromosome |
2. ref_pos | int | position (0-based) of the base in reference |
3. strand | char | strand (+/-) of the read |
4. read_id | str | name of the read |
5. read_pos | int | position (0-based) of the base in read |
6. mod_code | char | base modification code as in SAMtags: 1.7 Base modifications |
7. mod_prob | float | probability (0.0-1.0) of base modification |
minimod mod-freq ref.fa reads.bam > modfreqs.tsv
This writes base modification frequencies (default modification code "m" in CG context with modification threshold 0.8) to a file (modfreqs.tsv) file in tsv format.
Usage: minimod mod-freq ref.fa reads.bam
basic options:
-b output in bedMethyl format [not set]
-c STR modification codes (eg. m, h) [m]
-m FLOAT min modification threshold(s). Comma separated values for each modification code given in -c [0.8]
-t INT number of processing threads [8]
-K INT batch size (max number of reads loaded at once) [512]
-B FLOAT[K/M/G] max number of bases loaded at once [20.0M]
-h help
-p INT print progress every INT seconds (0: per batch) [0]
-o FILE output file [stdout]
--verbose INT verbosity level [4]
--version print version
Sample modfreqs.tsv output
contig start end strand n_called n_mod freq mod_code
chr22 20016337 20016337 + 5 0 0.000000 m
chr22 20016594 20016594 + 2 0 0.000000 m
chr22 20017045 20017045 + 1 0 0.000000 m
chr22 19970705 19970705 + 1 0 0.000000 m
chr22 19981716 19981716 + 1 1 1.000000 m
chr22 20020909 20020909 + 3 0 0.000000 m
chr22 19995719 19995719 + 4 2 0.500000 m
chr22 20017060 20017060 + 1 0 0.000000 m
chr22 19971259 19971259 + 1 1 1.000000 m
Field | Type | Definition |
---|---|---|
1. contig | str | chromosome |
2. start | int | position (0-based, inclusive) of the base |
3. end | int | position (0-based, inclusive) of the base |
4. strand | char | strand (+/-) of the read |
5. n_called | int | number of reads called for base modification |
6. n_mod | int | number of reads with base modification |
7. freq | float | n_mod/n_called ratio |
8. mod_code | char | base modification code as in SAMtags: 1.7 Base modifications |
Sample modfreqs.bedmethyl output
chr22 20016387 20016388 m 4 - 20016387 20016388 255,0,0 4 0.000000
chr22 20016820 20016821 m 1 + 20016820 20016821 255,0,0 1 0.000000
chr22 19999255 19999256 m 7 + 19999255 19999256 255,0,0 7 0.000000
chr22 20016426 20016427 m 1 + 20016426 20016427 255,0,0 1 100.000000
chr22 19988365 19988366 m 1 - 19988365 19988366 255,0,0 1 100.000000
chr22 19988168 19988169 m 1 - 19988168 19988169 255,0,0 1 100.000000
chr22 20016904 20016905 m 1 + 20016904 20016905 255,0,0 1 0.000000
chr22 20011898 20011899 m 8 - 20011898 20011899 255,0,0 8 25.000000
chr22 19990123 19990124 m 3 + 19990123 19990124 255,0,0 3 0.000000
chr22 19982787 19982788 m 1 + 19982787 19982788 255,0,0 1 0.000000
Field | Type | Definition |
---|---|---|
1. contig | str | chromosome |
2. start | int | position (0-based, inclusive) of the base |
3. end | int | position (0-based, not inclusive) of the base |
4. mod_code | char | base modification code as in SAMtags: 1.7 Base modifications |
5. n_mod | int | number of reads with base modification |
6. strand | char | strand (+/-) of the read |
7. start | int | = field 2 |
8. end | int | = field 3 |
9. n_mod | int | = field 5 |
10. freq | float | n_mod/n_called ratio |
Base modification codes and contexts can be set for both view and mod-freq tool using -c option to take only specific base modifications found in a given contexts. The context should match in the reference and bases in unmatching contexts are ignored.
Here is an example command to explain all possible context formats.
minimod view -c a[A],h[CG],m,a[*] ref.fa reads.bam
minimod mod-freq -c a[A],h[CG],m,a[*] ref.fa reads.bam
- a[A] : type a modifications of all A bases
- h[CG]: type h modifications in CG context (CpG sites)
- m: type m modifications in default CG context
- a[*]: type a modifications in all contexts
If the context is not specified in square brackets along with modification code, minimod will consider following default contexts.
All possible modification codes supported by minimod along with default contexts if not specified (SAMtags: 1.7 Base modifications)
Unmodified base | Code | Abbreviation | Name | Default context |
---|---|---|---|---|
C | m | 5mC | 5-Methylcytosine | CG |
C | h | 5hmC | 5-Hydroxymethylcytosine | CG |
A | a | 6mA | 6-Methyladenine | A |
Note that we have done a lot of testing on 5mc and some limited testing on 6mA and 5hmC. The others are not yet tested.
Base modification threshold can be set for mod-freq tool using -m option.
- 5mC modification(default context :CG) frequencies with threshold 0.8
minimod mod-freq -c m -m 0.8 ref.fa reads.bam
If p(5mC) >= 0.8 (threshold), called(5mC) and modified(5mC) If p(5mC) <= 0.2 (1-threshold), called(5mC) else, ignored as ambiguous mod_freq(5mC) = total_modified(5mC)/total_called(5mC)
- 5mC and 5hmC base modification(default context :CG) frequencies with thresholds 0.8, 0.7 respectively
minimod mod-freq -c m,h -m 0.8,0.7 ref.fa reads.bam
If p(5mC) >= 0.8 (threshold), called(5mC) and modified(5mC) If p(5mC) <= 0.2 (1-threshold), called(5mC) else, ambiguous mod_freq(5mC) = total_modified(5mC)/total_called(5mC)
If p(5hmC) >= 0.7 (threshold), called(5hmC) and modified(5hmC) If p(5hmC) <= 0.3 (1-threshold), called(5hmC) else, ignored as ambiguous mod_freq(5hmC) = total_modified(5hmC)/total_called(5hmC)
Make sure that you handle the modification tags correctly in each step in base modification calling pipeline (e.g., providing both -y
and -Y
to minimap2). See the example pipeline that we use below.
-
Use a basecalling model trained to identify modified bases.
Example: Base-calling a slow5 file using buttery-eel
buttery-eel --call_mods --config dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_hac.cfg -i reads.slow5 -o reads.sam -g path/to/guppy/bin samtools fastq -TMM,ML reads.sam > reads.fastq
-
Avoid unmapped reads
-
Avoid secondary alignments
-
Use soft clipping for supplementary alignments
Corresponding minimap2 flags are as follows.
Minimap2 Flag Description --sam-hit-only Avoid unmapped reads -Y Use soft clipping for supplementary alignments -y Copy input FASTA/Q comments to output --secondary=no Avoid secondary alignments Example: aligning ONT reads using minimap2
minimap2 -ax map-ont --sam-hit-only -Y -y --secondary=no ref.idx reads.fastq
-
If more than 90% of the reads in the BAM file are skipped due to various reasons (unmapped, 0 length, or missing MM/ML tags), minimod prints a warning message. However, if all of them are skipped minimod errors out.
-
If hard clipped non-primary alignments are found, minimod errors out. (to filter out non-primary alignments:
samtools view -h -F 2308 reads.bam -o primary_reads.bam
or use minimap2 with -Y to use soft clipping).
- Does not support alignment BAM files with modification codes as numeric ChEBI codes in MM tag
- Status of skipped bases (encoded as . or ? in MM tag) are ignored