This is the repository associated with the manuscript »Host-level biodiversity shapes the dynamics and networks within the coral reef microbiome«. It contains all code for analyses and figures presented in the manuscript.
To reproduce the analysis, you need a local installation of git (clone the repository with: git clone [email protected]:SushiLab/reef-microbiomics-paper.git
) and R. Download the data from https://zenodo.org/records/14131612 into a directory named data
at the same level as code
. As computational resources might be limited, we provide intermediary analyses files on Zenodo, too.
00 --> generate colour palette
01-15 --> curate data, including decontamination, merging, and rarefying
16-18 --> compute diversity metrics
19-21 --> community-based analyses: compare communities
22-23 --> community-based analyses: framework microbial exchange
24-27 --> community-based analyses: interaction networks
28-31 --> member-based analyses: interaction networks
32-35 --> member-based analyses: differential abundance analysis
36-40 --> functional prediction analyses
- Define colour palette (
colour_palette.RDS
)
- Remove reads (i) unclassified at the domain-level, (ii) belonging to the domain Eukaryota, (iii) belonging to the family Mitochondria, (iv) belonging to the order Chloroplast from the host-associated data
- Merge the two ASV tables using the
mergeSequenceTables()
andcollapseNoMismatch()
functions by dada2 - Rename ASVs and sample names
- Generate the ASV data table (
asv_dat_original_host-associated.tsv
) and ASV abundance table (asv_abtab_original_host-associated.tsv
)
- Compute and plot sequencing depth
- Subsample to 1000 reads
- Compare microbial communities (square root transformed Bray-Curtis distances)
- Analyse sample-to-blank distances
- Compare mucus vs control communities
- Run
decontam()
with thresholds 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5 and generate decontaminated ASV tables and stats files - Explore decontam scores
- Compute and plot percentage of original reads/ASVs kept after
decontam()
using various thresholds (=> default threshold of 0.1 works well)
- Subsample to 1000 reads
- Analyse sample-to-blank distances for both the original and the decontaminated (at 0.1) ASV abundance tables
- Compare microbial communities (square root transformed Bray-Curtis distances) after decontamination
- Run PERMANOVA and PERMDISP (=>
species
significant) - Compare microbial communities (square root transformed Bray-Curtis distances) by taxonomic group
- Remove all samples that were lost during the experiment as well as all positive and negative controls
- Write cleaned metadata (
metadat_cleaned_host-associated.tsv
), ASV data table (asv_dat_cleaned_host-associated.tsv
), and ASV abundance table (asv_abtab_cleaned_host-associated.tsv
; threshold = 0.1) - Compute and plot sequencing depth
- Remove reads (i) unclassified at the domain-level, (ii) belonging to the domain Eukaryota, (iii) belonging to the family Mitochondria, (iv) belonging to the order Chloroplast from the free-living and exuded data
- Merge the two ASV tables using the
mergeSequenceTables()
andcollapseNoMismatch()
functions by dada2 - Rename ASVs and sample names
- Generate the ASV data table (
asv_dat_original_free-living.tsv
) and ASV abundance table (asv_abtab_original_free-living.tsv
)
- Compute and plot sequencing depth
- Subsample to 1000 reads
- Compare microbial communities (square root transformed Bray-Curtis distances)
- Run
decontam()
with thresholds 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5 and generate decontaminated ASV tables and stats files - Explore decontam scores
- Compute and plot percentage of original reads/ASVs kept after
decontam()
using various thresholds (=> more stringent threshold of 0.2 works well)
- Subsample to 1000 reads
- Analyse sample-to-blank distances for both the original and the decontaminated (at 0.2) ASV abundance tables
- Compare microbial communities (square root transformed Bray-Curtis distances) after decontamination
- Run PERMANOVA and PERMDISP (=>
species
significant) - Compare microbial communities (square root transformed Bray-Curtis distances) by taxonomic group
- Remove all samples that were lost during the experiment as well as all positive and negative controls
- Write cleaned metadata (
metadat_cleaned_free-living.tsv
), ASV data table (asv_dat_cleaned_free-living.tsv
), and ASV abundance table (asv_abtab_cleaned_free-living.tsv
; threshold = 0.2) - Compute and plot sequencing depth
- Merge the decontaminated (host-associated and free-living) ASV tables using the
mergeSequenceTables()
andcollapseNoMismatch()
functions by dada2 - Remove singleton ASVs (present in a single sample at an abundance of 1)
- Rename ASVs and sample names
- Generate the ASV data table (
asv_dat_original.tsv
) and ASV abundance table (asv_abtab_original.tsv
)
- Merge the two metadata tables (
metadat_cleaned.tsv
) - Compute and plot sequencing depth
- Rarefy 50 times to 2500 reads per sample (
asv_abtab_rarefied_{1,50}.tsv
) - Write metadata excluding dropped samples (
metadat_rarefied.tsv
)
- Load all 50 rarefied ASV abundance tables
- Extract the sequence information
- Write fasta files (
asv_abtab_rarefied_{1,50}.fasta
)
- Assign taxonomic info to ASVs using
silva_nr99_v138.1_train_set.fa
- Write taxonomy file (
asv_dat_taxinfo.tsv
)
- Load all 50 rarefied ASV abundance tables and the associated metadata
- Compute the square-root transformed Bray-Curtis distances for each of the 50 ASV tables and save as a list
- Set the diagonal to NA
- Reduce the list to a single distance table by taking the sum of the corresponding elements and calculating the mean by dividing by the length of the dataframe
- Write the square-root transformed Bray-Curtis distance table (
asv_bctab.tsv
)
- Load all 50 rarefied ASV abundance tables and the associated metadata
- Compute the hill number diversity indices of each sample for each of the 50 ASV tables and save as a list
- Reduce the list to a single distance table by taking the sum of the corresponding elements and calculating the mean by dividing by the length of the dataframe
- Write the richness table (
asv_richtab.tsv
)
- Load all 50 rarefied ASV abundance tables and the associated metadata
- Compute the number of shared ASVs for each of the 50 ASV tables and save as a list
- Set the diagonal to NA
- Reduce the list to a single distance table by taking the sum of the corresponding elements and calculating the mean by dividing by the length of the dataframe
- Write the number of shared ASVs table (
asv_shared.tsv
)
Compare the host-associated and exuded microbial communities by species/sample type
- Load the square-root transformed Bray-Curtis distance table
- Perform a PCoA on the distances
- Run a PERMANOVA to test for differences explained by species and PERMDISP (=>
species
significant)
Correlate alpha-diversity changes of exuded vs host-associated microbial communities
- Load the richness table
- Plot exudate vs host Hill numbers by species and get correlation coefficient
Compare the host-associated and exuded microbial communities of hosts living in biodiverse vs degraded reef communities
- Select samples from phase 2
- For each source and species, run a PERMANOVA to test for differences explained by the ambient community complexity
Explore host-specificity and exudation
- Transform the distance table into a 3-column dataframe (keeping all data from all three phases)
- Prepare the host-to-tank distance dataframe
- Prepare the exudates-to-incubation control dataframe
- Plot exudate to control (x-axis) vs host to tank (y-axis)
- Draw an ellipse at 0.5 (half of the points lie within the ellipse)
- Use data to define cutoffs for groups (mean)
- Test whether the groups are significantly different (=>
species
significant) - Add standard deviation to each data point (as one point is the mean of three distances)
Explore host-specificity and exudation as a function of the complexity of the reef community
- Select samples from phase 2
- Transform the distance table into a 3-column dataframe
- Prepare the host-to-tank distance dataframe
- Prepare the exudates-to-incubation control dataframe
- Compare both distances of hosts living in a biodiverse vs degraded reef community by species with
p.adjust.method="holm"
- Test whether the mean of the groups is significantly different as a response to treatment
- Select species samples from phase 2
- Extract distances from
asv_bctab.tsv
across treatment and source - Group by host and average distance across samples
- Convert dissimilarity to similarity
- Write the edge files
- Compute stats on the full data comparing treatments and write the stats files
- Select species samples from phase 2
- Extract richness from
asv_richtab.tsv
across treatment and source - Group by host and average richness across samples
- Write the node files
- Compute stats on the full data comparing treatments and write the stats files
- Load edge and node files
- Compute similarity ratio between biodiverse and degraded by calculating
$0.5-\frac{similarity_{degraded}}{similarity_{biodiverse}+similarity_{degraded}}$ (=> negative values mean the similarity between hosts is higher in a degraded habitat, positive values in a biodiverse habitat; the higher the values the more the similarities between the two treatments differ) - Compute richness ratio between biodiverse and degraded by calculating
$0.5-\frac{richness_{degraded}}{richness_{biodiverse}+richness_{degraded}}$ (=> negative values mean the richness of a host is higher in a degraded habitat, positive values in a biodiverse habitat; the higher the values the more the richness between the two treatments differ) - Initiate the network plot using the packages
igraph
andggraph
- Plot the network with colours denoting in which treatment the values are higher and line width/circle size indicating the magnitude of the difference
- Load edge and node files
- Initiate the network plot using the packages
igraph
andggraph
- Plot the network with line width and colours denoting the magnitude of the compositional interactions and circle size indicating the richness
- Select species samples from phase 2
- Extract distances from
asv_shared.tsv
across treatment and source - Group by host and average distance across samples
- Write the edge files
- Compute stats on the full data comparing treatments and write the stats files
- Load all 50 rarefied ASV abundance tables and the associated metadata
- Select species samples from phase 2
- Group by host and compute number of unique ASVs across treatment and source using the function
upset_data()
- Reduce the list to a single distance table by taking the sum of the corresponding elements and calculating the mean by dividing by the length of the dataframe
- Write the node files
- Load edge and node files
- Compute shared ratio between biodiverse and degraded by calculating
$0.5-\frac{shared_{degraded}}{shared_{biodiverse}+shared_{degraded}}$ (=> negative values mean the number of shared ASVs between hosts is higher in a degraded habitat, positive values in a biodiverse habitat; the higher the values the more the number of shared ASVs between the two treatments differ) - Compute unique ratio between biodiverse and degraded by calculating
$0.5-\frac{unique_{degraded}}{unique_{biodiverse}+unique_{degraded}}$ (=> negative values mean the number of unique ASVs of a host is higher in a degraded habitat, positive values in a biodiverse habitat; the higher the values the more the number of unique ASVs between the two treatments differ) - Initiate the network plot using the packages
igraph
andggraph
- Plot the network with colours denoting in which treatment the values are higher and line width/circle size indicating the magnitude of the difference
- Load edge and node files
- Initiate the network plot using the packages
igraph
andggraph
- Plot the network with line width and colours denoting the level of shared ASVs and circle size indicating the unique ASVs
Analyse the differentially abundant ASVs between source and treatment in two separate analyses
- Select samples from phase 2
- Load the modified ancombc R files
- For each source separately, subset and transform the data by species
- Construct TreeSummarizedExperiment object
- Use
ancombc2()
with fixed effect$treatment_binary$ andp_adj_method="holm"
- Write ANCOMBC2 output files (
diffab_species.tsv
)
Plot the DA output
- Plot log-fold changes (based on the natural log) with significant differences in red (after correcting for multiple comparisons using the Holm method)
Prepare the compilated log-fold change file for iTOL
- Merge the DA output
- Pull the significant log-fold changes between biodiverse and degraded communities (
asv_diffab_lfc.tsv
)
Examine the phylogeny of the differentially abundant ASVs (across host)
- Align sequences through multiple sequence aligner using Muscle
- Test maximum-likelihood models
- Select the two superior ones and compute trees
- Bootstrap 100 times
- Download the Pathway list, the KO list, and the KO to Pathway link files
- Reformat and write files (
pathway_list.tsv
,ko_list.tsv
,ko_pathway_link.tsv
)
#! /bin/bash
#SBATCH --job-name picrust # how the job will be named
#SBATCH --output picrust.log # filename for the output messages
#SBATCH --error picrust.log # filename for the error messages
#SBATCH -n 20 # number of CPUs
#SBATCH --mem-per-cpu=5G # memory per CPU
#SBATCH --time=14-23:00:00 # maximum time to complete (after which it is killed)
eval "$(conda shell.bash hook)"
conda activate picrust2_env
mkdir /[INSERT PATH]/data/picrust_output
for i in /[INSERT PATH]/data/picrust_input/*.fasta ; \
do j=$( echo $i | sed 's/.fasta//' | sed 's/.*picrust_input[/]//' ) ; \
picrust2_pipeline.py -s "$i" -i /[INSERT PATH]/data/picrust_input/"$j".tsv \
-o /[INSERT PATH]/data/picrust_output/"$j" -p 20 ; done
- Load all 50 generated predicted unstratified metagenome files (
pred_metagenome_unstrat.tsv.gz
) - Get a list of the functions (K numbers) that are present across all 50 rarefaction files
- Extract these functions from each abundance file and save as a list
- Reduce the list to a single abundance table by taking the sum of the corresponding elements and calculating the mean by dividing by the length of the dataframe
- Write the abundance table (
ko_abtab_sample.tsv
)
#! /bin/bash
#SBATCH --job-name picrust # how the job will be named
#SBATCH --output picrust.log # filename for the output messages
#SBATCH --error picrust.log # filename for the error messages
#SBATCH -n 20 # number of CPUs
#SBATCH --mem-per-cpu=5G # memory per CPU
#SBATCH --time=14-23:00:00 # maximum time to complete (after which it is killed)
eval "$(conda shell.bash hook)"
conda activate picrust2_env
pathway_pipeline.py -i /[INSERT PATH]/data/ko_abtab_sample.tsv \
-o /[INSERT PATH]/data/picrust_output \
--no_regroup --map /[INSERT PATH]/data/KEGG_pathway_to_ko.tsv -p 20
- Load the KEGG pathway list (
pathway_list_cat.csv
) with hand annotated categories - Load the KEGG pathway converted abundance table (
path_abun_unstrat.tsv.gz
) and extract pathways associated with metabolism, but remove global and overview maps - Compute the square-root transformed Bray-Curtis distances
- Perform a PCoA on the distances
- Run a PERMANOVA to test for differences explained by species and PERMDISP (=>
species
significant) and test pairwise usingpairwise.adonis2()
- Load all 50 generated predicted KO abundance files (
KO_predicted.tsv.gz
) - Get a list of the KOs that are present across all 50 rarefaction files
- Extract these KOs from each abundance file and save as a list
- Reduce the list to a single abundance table by taking the sum of the corresponding elements and calculating the mean by dividing by the length of the dataframe
- Write the abundance table (
ko_abtab_asv.tsv
)
- Load the KEGG pathway list (
pathway_list_cat.csv
) with hand annotated categories - Load the ko-to-pathway link file (
ko_pathway_link.tsv
) and extract pathways associated with metabolism, but remove global and overview maps - Load the KO abundance table (
ko_abtab_asv.tsv
) - Select the ASVs of interest (two closely related members within the family Flavobacteriaceae that are differentially abundant as a response to treatment in Sinularia sp.)
- Summarise the KOs to pathways
- Remove pathways that are absent
- Convert pathway abundance table to presence-absence and annotate
- Generate upset plot