Skip to content

Workflow stages and data for Morando, Magasin et al. 2024

License

Notifications You must be signed in to change notification settings

jdmagasin/nifH-ASV-workflow

Repository files navigation

Workflow for processing nifH amplicon data sets

This repository contains all post-pipeline software stages and data deliverables described in Morando, Magasin et al. 2024. The workflow was used to process nearly all published nifH amplicon MiSeq data sets that existed at the time of publication, as well as two new data sets produced by the Zehr Lab at UC Santa Cruz. The samples are shown in this map which links to an interactive Google map with study names, sample IDs, and collection information for each sample. Map of studies used in Morando, Magasin et al. 2024

Workflow overview

The following figure shows the workflow: Overview of DADA2 niH workflow DADA2 ASVs were created by our DADA2 nifH pipeline (green). Post-pipeline stages (lavender), each executed by a Makefile or Snakefile, were used to gather the ASVs from all studies, filter the ASVs for quality, and annotate them, as well as to download sample-colocated environmental data from the Simons Collaborative Marine Atlas Project (CMAP). The nifH ASV database generated by the workflow will support future research into N2-fixing marine microbes. The published database and any updated versions are available within the WorkspaceStartup directory, both as nifH_ASV_database.tgz as well as the R image, workspace.RData. The published database is also available at https://doi.org/10.6084/m9.figshare.23795943.v1.

Running the workflow

The workflow requires the DADA2 nifH pipeline as well as its ancillary tools. Please see the Installation directory in the pipeline repository. Additionally you will need to install GNU make to run the post-pipeline stages, each of which uses a Makefile. Please see the Installation instructions.

The DADA2 nifH pipeline outputs for all studies are provided in the Data directory. You do not need to run the pipeline. However, if you wish to run the pipeline, the parameters files used for each study are included in Data. You are free to modify them.

Each of the post-pipeline stages 1 through 6 can be run -- in order -- by entering the associated directory and running "make" from your shell's command line. For example, if I wanted to run the GatherAsvs stage I would do the following from the command line:

(base) [jmagasin@thalassa]$ conda activate nifH_ASV_workflow
(nifH_ASV_workflow) [jmagasin@thalassa]$ cd GatherAsvs
(nifH_ASV_workflow) [jmagasin@thalassa]$ make &> log.18July2023.txt &

Here I am using a BASH shell (recommended). First I activate the nifH_ASV_workflow environment, a critical step that ensures that all tools and packages needed by the workflow are available. Note how activation changes the prompt to begin with "(nifH_ASV_workflow)" on line two. On the third line, I make the GatherAsvs stage and save the Makefile messages to a log file. Most stages take at least a few minutes to complete so I run them in the background (the trailing &).

Please see documentation at the top of each Makefile for an overview of the stage.


Copyright (C) 2023 Michael B. Morando and Jonathan D. Magasin