metagenomic binning, quantification, and analysis with metawrap
Notes:
- Bin 26 is E. Coli, Bin 52 is M. aeruginosa, Bin 18 should be Runella slithyformis
- Bin 43 is a new thing: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=1925548&lvl=3&lin=f&keep=1&srchmode=1&unlock
- Bin 43 contigs are 100% classified as Fonsibacter
TODO:
- Pull reads that map to 16S gene from libraries, classify, and assign to bins based on combined objective of taxanomic class & abundance vector matches
- Make master script to string together accesory shell scripts from marcc
- Add Shell scripts from MARCC to scripts
- use phylosift to add taxonomy to poorly classified bins
- check degree of homology between & within bins
- determine the degree of e. coli sequence dispersion
- compare environmental gene abundances within each bin to overall bin abundance
- make map of gene:process and plot process by sample in violins
- determine which CAZy genes are responsive to DOM & check their internal consistency
Contents of Data/
include:
Bin_Taxonomy.xlsx
: Consensus bin taxonomy & scoresBinningMethodsText.docx
: Description of methods required by GSCMAG_Min_Standards_Description.xlsx
: GSC reference document for req'd MAG infoMystic_MAG_Quality_Stats.xlsx
: Individual MAG stats incl. # unique tRNAs # rRNA seqs, completeness, contamination, GC, # contigs/genes/scaffolds, Coding density, GC pct, GC std, Genome size, contig/scaffold length, N50, and moreBin_Abundances/
bin_abundance.tab
: matrix of average bin coverage per samplebin_counts_normed.tsv
: unit normalized within sample & then normalized by library sizegenome_abundance_heatmap.png
: clustered heatmap ofbin_abundance.tab
with assoc. dendogramssample_read_count.tab
: vector of read counts per sample for normalizingbin_abundance.tab
Blob_Plots/
: deeply unhelpful plots of GC% v. coverage of contigs colored in various waysKEGG_Annotations/:
AllProteinsRenamed.faa.bz2
: FASTA of ALL protein sequences with bin name & gene id in header (uploaded to KEGG)Aggregated_Annotations.tsv
: Combined & quality filtered KEGG annotations by GhostKoala & BlastKoala programsSelect_Annotations.tsv
: Same format asAggregated_Annotations.tsv
but restricted to KO numbers of interestSelect_Ks_By_Bin.tsv
: a matrix of select KO hit counts by binAll_HMM_Hits_raw.tsv
: a matrix of db-CAN & metabolic-hmm annotations with gene id & bin infoHMM_counts_by_bin.tsv
: a matrix of hmm hit counts by bingene_abundances_raw.tsv
: a matrix of select annotation coverages by sample (sample normalized)duplicate_clusters.tsv
: output by Salmon as duplicate sequences while calculating coverage (*by_bin.tsv files need correction)
Krona_Plots/
:mysticLibs_kronagram.html
: assumed taxanomic abundances using QC'd reads as inputfinal_assembly_kronagram.html
: assumed taxanomic abundances using assembled contigs as input
QUAST_CoAssembly_Stats/
report.html
: contains co-assembly descriptive statistics
16S_Info/
16s_annotations.gff
: contains info on bin #, contig id, gene id, and locus, & lengthRibosomalRNA.RDP_classes.txt
: contains taxa classes for assembled 16S rRNARibosomalRNA.fa
: contains assembled 16S rRNA sequence with bin number of origin & gene id in header
intermediate_files/
: probably unecessary stuff