Skip to content

Commit

Permalink
Merge pull request #8 from cmmr/patch_v0.4
Browse files Browse the repository at this point in the history
update schematic and readme
  • Loading branch information
mtisza1 authored Sep 13, 2023
2 parents a553a91 + 93e61ec commit f3ff94b
Show file tree
Hide file tree
Showing 3 changed files with 6,268 additions and 31 deletions.
80 changes: 49 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Trans-Kingdom Marker Gene Pipeline for Taxonomic Profiling of Human Metagenomes

If you want to detect and quantify phages, bacteria, archaea, and/or microeukaryotes in whole genome shotgun reads from human-derived samples, `Marker-MAGu` is the tool for you.
If you want to detect and quantify phages, bacteria, and archaea in whole genome shotgun reads from human-derived samples, `Marker-MAGu` is the tool for you.

Basically, this tool uses the strategy (marker gene detection) and database (marker gene sequences) from [Metaphlan4](https://github.com/biobakery/MetaPhlAn), then:

Expand All @@ -13,6 +13,8 @@ Basically, this tool uses the strategy (marker gene detection) and database (mar

Also, as in `Metaphlan4` **SGBs**, or **S**pecies-level **G**enome **B**ins are genomically-distinct species.

<img src="schematic/marker_magu1.png" width="150"/>

## Schematic

![Marker-MAGu](schematic/schematic1.png)
Expand All @@ -39,33 +41,32 @@ Note: if you can't or won't use `Conda` for environment management, you can chec

`conda activate Marker-MAGu`

5) Make a command line entry point
5) Make a command line entry point

`pip install .`

5) Download the database in the `Marker-MAGu` directory (\~9.6 GB when decompressed).
5) Download the database (\~9.6 GB when decompressed).

`cd Marker-MAGu` *or `cd` to where you want the database to reside*

`mkdir DBs && cd DBs`

`wget https://zenodo.org/record/7975097/files/Marker-MAGu_markerDB_v1.0.tar.gz`
`wget https://zenodo.org/record/8342581/files/Marker-MAGu_markerDB_v1.1.tar.gz`

`md5sum`

should return `2e3a44c790765c55dcdc5e5d4521ac3a`
should return `e0947cb1d4a3df09829e98627021e0dd`

`tar -xvf Marker-MAGu_markerDB_v1.0.tar.gz`
`tar -xvf Marker-MAGu_markerDB_v1.1.tar.gz`

With default instructions, you should now have the database at: `Marker-MAGu/DBs/v1.0/Marker-MAGu_markerDB.fna`
With default instructions, you should now have the database at: `/path/to/DBs/v1.1/Marker-MAGu_markerDB.fna`

`rm Marker-MAGu_markerDB_v1.0.tar.gz`
`rm Marker-MAGu_markerDB_v1.1.tar.gz`

6) Set the `Marker-MAGu` database directory variable **MARKERMAGU_DB**, e.g.:
6) Set the `Marker-MAGu` database directory variable **MARKERMAGU_DB**, e.g.:

`conda env config vars set MARKERMAGU_DB=/path/to/DBs/v1.0`


## (OPTIONAL) Database for filtering out host reads and spike-ins

You could filter unwanted sequences out upstream of this tool, but this will allow you to do it within `Marker-MAGu` using `minimap2`. The pipeline script will look for a file at `filter_seqs/filter_seqs.fna` which could be any fasta-formatted sequence file you want to use to remove matching reads (e.g. from host or spike-in).
Expand Down Expand Up @@ -122,25 +123,31 @@ Individual samples can be run with the python script. E.g.:
**Basic run with 1 .fastq file:**

```
markermagu -r /path/to/reads/myreads.fastq -s sample_ABC -t 16 -o myproject_MM1
markermagu -r /path/to/reads/myreads.fastq -s sample_ABC -o myproject_MM1
```

**Using multiple input .fastq files (`Marker-MAGu` doesn't used paired-end info)**

```
markermagu -r /path/to/reads/myreads1.fastq /path/to/reads/myreads2.fastq -s sample_ABC -t 16 -o myproject_MM1
markermagu -r /path/to/reads/myreads1.fastq /path/to/reads/myreads2.fastq -s sample_ABC -o myproject_MM1
```

**Using multiple input .fastq files with wildcard**

```
markermagu -r /path/to/reads/*.fastq -s sample_ABC -t 16 -o myproject_MM1
markermagu -r /path/to/reads/*.fastq -s sample_ABC -o myproject_MM1
```

**With pre-filtering steps:**

```
markermagu -r /path/to/reads/myreads.fastq -s sample_ABC -t 16 -o myproject_MM1 -q True -f True
markermagu -r /path/to/reads/myreads.fastq -s sample_ABC -o myproject_MM1 -q True -f True
```

**With relaxed detection**

```
markermagu -r /path/to/reads/myreads.fastq -s sample_ABC -o myproject_MM1 --detection relaxed
```

**Help menu**
Expand Down Expand Up @@ -196,11 +203,23 @@ wide_dt <- long_dt %>%
values_from = rel_abundance, values_fill = 0)
```

### Source for virus taxonomy

| Level | Source |
|---------|-----------------------------|
| Kingdom | 'Virus' for all |
| Phylum | Genomad |
| Class | Genomad |
| Order | Genomad |
| Family | Genomad |
| Genus | Vcontact2 (arbitrary names) |
| Species | ANIclust (arbitrary names) |

### Other notes

Relative abundance values from each sample will add up to 1, so there is no "unknown fraction estimate"

Currently, `Marker-MAGu` only profiles phages with 4 or more unique marker genes, so small, ssDNA phages such as microviruses and inoviruses are unlikely to be detectable. I'm working on it.
Currently, `Marker-MAGu` only profiles phages with 4 or more unique marker genes, so small, ssDNA phages such as microviruses and inoviruses may be undetectable. I'm working on it.

Should any issues arise, please leave an issue in this GitHub Repo.

Expand Down Expand Up @@ -247,6 +266,7 @@ blast v2.9.0
**1) call genes**

make sure variable is set for your file `MYSEQS="my_viruses1"`

```
prodigal -a ${MYSEQS}.prot.faa -d ${MYSEQS}.genes.fna -i ${MYSEQS}.fna -p meta -q
```
Expand Down Expand Up @@ -289,28 +309,28 @@ seqkit grep -j 16 -f ${MYSEQS}.hallmark_gene_list.txt ${MYSEQS}.prot.faa > ${MYS

This command will only keep genomes with 4 or more hallmark genes as is recommended for `Marker-MAGu`. See awk expression `if ($1>=4)`.

```
```
sed 's/[^_]*$//' ${MYSEQS}.hallmark_gene_list.txt | sort | uniq -c | awk '{if ($1>=4) {print $2}}' > ${MYSEQS}.marker_genomes_list.txt
```

**2) Make fastas of genome-specific marker genes**

```
```
mkdir indiv_genomes
cat ${MYSEQS}.marker_genomes_list.txt | xargs -n 1 -I {} -P 16 seqkit grep -j 1 --quiet -r -p "{}" ${MYSEQS}.hallmark_genes.nucl.fna -o indiv_genomes/{}_hmg.fna
```

**3) Concatenate hallmark genes for each genome**

```
```
HMGS=$( find indiv_genomes/ -type f -name "*__hmg.fna" | sed 's/\.fna//g' )
if [ -n "$HMGS" ] ; then
for HM in $HMGS ; do
bioawk -v genq="${HM#indiv_genomes/}" -c fastx '{if (NR == 1) {print ">"genq ; print $seq} else {print $seq }}' ${HM}.fna > ${HM}.concat.fna ;
done
for HM in $HMGS ; do
bioawk -v genq="${HM#indiv_genomes/}" -c fastx '{if (NR == 1) {print ">"genq ; print $seq} else {print $seq }}' ${HM}.fna > ${HM}.concat.fna ;
done
else
echo "hallmark gene files for each genome NOT found"
echo "hallmark gene files for each genome NOT found"
fi
cat indiv_genomes/*_hmg.concat.fna > ${MYSEQS}.hallmark_genes.nucl.concat.fna
Expand All @@ -320,10 +340,9 @@ cat indiv_genomes/*_hmg.concat.fna > ${MYSEQS}.hallmark_genes.nucl.concat.fna

NOTE: If all-vs-all BLASTN only returns self-alignments, the `anicalc.py` script will return an error. In this case, just use the `${MYSEQS}.marker_genomes_list.txt` file instead of the `${MYSEQS}.hallmark_genes.nucl.concat.exemplars1.txt` for step 3).


**1) BLASTN**

```
```
mkdir blast_DBs
makeblastdb -in ${MYSEQS}.hallmark_genes.nucl.concat.fna -dbtype nucl -out blast_DBs/${MYSEQS}.hallmark_genes.nucl.concat
Expand All @@ -333,7 +352,7 @@ blastn -query ${MYSEQS}.hallmark_genes.nucl.concat.fna -db blast_DBs/${MYSEQS}.h

**2) anicalc/aniclust to find redundant sequences**

```
```
python /path/to/Marker-MAGu/src/utils/anicalc.py -i ${MYSEQS}.hallmark_genes.nucl.concat.blastn.tsv -o ${MYSEQS}.hallmark_genes.nucl.concat.anicalc.tsv
python /path/to/Marker-MAGu/src/utils/aniclust.py --fna ${MYSEQS}.hallmark_genes.nucl.concat.fna --ani ${MYSEQS}.hallmark_genes.nucl.concat.anicalc.tsv --out ${MYSEQS}.hallmark_genes.nucl.concat.aniclust.ANI95_TCOV85.tsv --min_ani 95 --min_tcov 85 --min_qcov 0
Expand All @@ -343,27 +362,27 @@ cut -f1 ${MYSEQS}.hallmark_genes.nucl.concat.aniclust.ANI95_TCOV85.tsv > ${MYSEQ

**3) Retreive all the INDIVIDUAL hallmark genes for exemplars**

```
```
bioawk -c fastx '{print $name, $seq}' ${MYSEQS}.hallmark_genes.nucl.fna | grep -F -f ${MYSEQS}.hallmark_genes.nucl.concat.exemplars1.txt | awk '{OFS=FS="\t"}{print ">" $1 ; print $2}' > ${MYSEQS}.hallmark_genes.nucl.exemplars1.fna
```

### Add marker genes to Marker-MAGu

**1) Format marker genes for `Marker-MAGu`**

This is kind of an unrefined piece of code as it does not assign meaningful taxonomical assignments at any level except species ("s__"), which it merely assigns the sequence name. However, it does fulfill the requirements for `Marker-MAGu`, which is to format the header as:
This is kind of an unrefined piece of code as it does not assign meaningful taxonomical assignments at any level except species ("s\_\_"), which it merely assigns the sequence name. However, it does fulfill the requirements for `Marker-MAGu`, which is to format the header as:

*gene_name;hierarchical_taxonomy;genome_of_origin_name*

```
```
sed 's/[^_]*$/@_&/ ; s/^@_//g' ${MYSEQS}.hallmark_genes.nucl.exemplars1.fna | bioawk -c fastx '{ split($name, a, "_@") ; print ">"a[1]a[2]";k__Viruses|p__Unclassified_viruses|c__Unclassified_viruses|o__Unclassified_viruses|f__Unclassified_viruses|g__Unclassified_viruses|s__vOTU_"a[1]";"a[1] ; print $seq}' > ${MYSEQS}.hallmark_genes.nucl.exemplars1.fmt.fna
```

If you want to assign more complex taxonomy to your sequences based on some other tool, please make an Issue and I can upload some additional scripts to the repo to facilitate this.

**2) Concatenate your formatted marker genes to the existing `Marker-MAGu` database**
**2) Concatenate your formatted marker genes to the existing `Marker-MAGu` database**

```
```
cd Marker-MAGu/DBs
mkdir v_${MYSEQS}
Expand All @@ -373,7 +392,6 @@ cat v1.0/Marker-MAGu_markerDB.fna /path/to/${MYSEQS}.hallmark_genes.nucl.exempla

That's it. Remember to specify the updated database when running `Marker-MAGu`. E.g. `--db v_my_viruses1`


## Citation

(insert)
Loading

0 comments on commit f3ff94b

Please sign in to comment.