A workflow for quantifying bacterial flagellins in human gut microbiome sequencing data, annotating their human TLR5 interaction phenotypes and conducting statistical analysis and visualization.
git clone https://github.com/leylabmpi/FlaPro.git
cd ./FlaPro/snakemake/bin/
git submodule add https://github.com/leylabmpi/ll_pipeline_utils.git
git submodule update --remote --init --recursive
cd ../
If conda is used, then:
conda env create -f snakemake8_min.yaml
If not, singularity image is provided and can be pulled:
mkdir images
cd images
singularity pull library://aaaabogdanova/flapro/primary_env:latest
singularity pull library://aaaabogdanova/flapro/secondary_env:latest
cd ../
One have to download usearch if needed in the following directory:
cd bin/script
mkdir usearch
cd usearch
wget ./ {usearch_download_link} (might be found here: https://www.drive5.com/usearch/download.html)
gzip -d {usearch.gzip}
An example of running FlaPro is provided as a bash script in the snakemake pipeline directory:
#conda based
./runLLHFP.sh #Important: user should modify the script in order to initialize conda correctly (see the instructions in the script)
#container based
#first, load the snakemake module, if modular system is used or any other ways to enable snakemake
#then,
./runAppt.sh
User-provided
- Metagenomic or Metatranscriptomic samples - reads in FASTQ or FASTA format (optionally compressed)
Reference data
- Taxonomical annotation of flagellins
- Functional annotation of flagellins
- Marker sequences from human gut microbiome derived flagellins
The config.yaml file is organized into several main sections:
- Input data
- Output directory
- Workflow control settings
- Parameters
config.yaml - default configuration file. Edit this, if use conda-based environment config_apptainer.yaml - configuration file with enabled container. Edit this file, if conda is unavailable.
Specify the path to a sample file containing your metagenomic or meta-transcriptomic sample-to-read-file mappings:
# format example file:
samples_file: datatest/input_MTG4_nano.txt
The sample sheet file should include:
- Sample ID
- Relative or absolute path to the forward reads (R1)
- Relative or absolute path to the reverse reads (R2), in case of paired-end sequencing
Define the root folder corresponding to the relative paths above:
read_file_path: None # when the paths are absolute
# read_file_path: /path/to/your/reads/ # when the paths are relative
Provide the destination directory for the primary analysis output, for example:
output_dir: out/test_ibd_MTG4_nano_test/
Specify the location for the temporary files (ensure there is enough room in case of large datasets):
tmp_dir: tmp/ # Adjust based on your system's temp directory
Enable/disable major pipeline components:
run_pipeline_steps:
alpha_div: True #or False # Enable alpha diversity calculations
Configure the Snakemake workflow execution:
pipeline:
snakemake_folder: ./ # Path to Snakemake files
export_conda: True # Export conda environment
name: LLHFP # Pipeline name identifier
#just_read1: True #used when there is only R1 reads
params:
shortbred_quantify:
aligner: diamond # Options: diamond, usearch
# usearch_path: bin/scripts/usearch/usearch11.0.667_i86linux32 # uncomment, if using USEARCH
markers: ref/Curated_fla_markers_4_04-12-24.fasta # Flagellin marker database
pct_length: 0.3 # Minimum alignment length (30%)
Aligner Options:
diamond
: Faster, fewer false positives, recommended for large datasetsusearch
: More sensitive; the freely available version might not work with large datasets
merge_realcounts:
merge_script: snakemake/bin/scripts/merge_realcounts.R
See config.yaml
and config_apptainer.yaml
Example:
./real_counts |
`SRR5935740.txt` - output per sample with Family (Cluster), Hits,
`merged_realcounts.txt` - merged output for all the samples by real counts
`psq.RData` - psq object with taxonomy and abundance table
./diversity |
`alpha_div.txt` - calculated alpha diversity tables
After the primary analysis has finished successfully to yield the annotated flagellin relative abundance tables, you can add your sample metadata and do exploratory analysis using the secondary analysis code. It is provided in the form of R Jupyter notebooks (.ipynb files).
To set up the environment for the secondary analysis, you will need:
- Conda (https://docs.conda.io/en/latest/)
- Visual Studio Code (or an alternative integrated development enviroment supporting running R notebooks via a defined Conda environment)
Create a specific Conda environment using the YAML file provided in the envs/ folder:
conda env create -f r_433_nb.yaml
conda activate r_433_nb
Then install the following non-Conda -based packages into it:
R
devtools::install_github("tpq/balance")
devtools::install_github("malucalle/selbal")
devtools::install_bitbucket("knomics/nearestbalance")
devtools::install_github("leylabmpi/LeyLabRMisc")
(these instructions can be also run using the provided envs/...postBuild.sh script)
Open the notebook in VS Code, select the R Jupyter kernel of the installed environment and run the notebook. Further information on how to generate your own notebooks easily synchronizable across multiple projects is provided in a separate readme file.
Note: while the main input files for the secondary analysis are generated during the primary analysis, you have to prepare the additional files with the number of reads per sample (sample coverage), for example, using the scripts/count_reads.sh
script.