This repository contains the scripts for the analysis of the ISD Crete 2016. ISD Crete 2016 is a soil microbiome sampling from 72 distinct sites around Crete island, Greece.
During the 18th Genome Standards Consortium workshop the participants along with members from Hellenic Centre for Marine Research organised the Island Sampling Day Crete 2016.
The goal of this sampling was to put metagenomic standards into play.
For more info regarding the methodology of the sampling visit the website.
- Scripts
- Data retrieval
- Sequences
- Inference and Taxonomy
- Metadata
- Spatial data
- Analysis
- Software
- Hardware
- Citation
- Licence
The scripts of the analysis are in the scripts
folder and cover the following tasks:
- dada2 hpc job script
- dada2 analysis script
- pema hpc job script
output:
- spatial data script
- biodiversity script, UMAP ordination script and FAPROTAX command
./collapse_table.py -i ../results/faprotax_community_matrix.tsv -o ../results/faprotax_functional_table.tsv -g FAPROTAX.txt -d taxonomy --omit_columns 0 -r ../results/faprotax_report.txt -n columns_after_collapsing -v --collapse_by_metadata taxonomy --out_sub_tables_dir ../results/function_tables -v
- numerical ecology script
- figures script
The sequences of the ISD Crete 2016 are available in ENA project PRJEB21776
and were downloaded using the ENA API. The script is available here for the
raw data and here for the metadata.
The metadata are in xml
format and using this script we transform them to
tabular format.
From 4 samples DNA quality was low and there aren't any sequences nor metadata.
To find which samples don't appear in metadata run the following command:
gawk -F"\t" '($2 ~ /source material identifiers/){split($3,sample,"_"); a[sample[2]][sample[4]]++}END{for (i in a){for (j in a[i]){print i "\t" j "\t" a[i][j]}}}' ena_isd_2016_attributes.tsv
The samples DNA wasn't sequenced are:
- isd_7_site-3_loc_2
- isd_10_site_1_loc_1
- isd_10_site_1_loc_1
- isd_10_site_2_loc_2
In this script there are some oneliners to provide some basic information regarding the reads.
Total reads (forward and reverse) = 121232490 in 140 samples
Primers used are FWD: 5'-ACTCCTACGGGAGGCAGCAG-3' REV: 5'-GGACTACHVGGGTWTCTAAT-3'
Numbers of Ns in reads : there are many reads with Ns and the dada
function
doesn't accept so after the filtering they are all removed.
We used PEMA and DADA2 for the clustering of OTUs and ASVs, respectively.
PEMA incorporates state of the art tools for each step of the analysis.
The filtering step is slower than DADA2 because of Trimmomatic.
DADA2 for our dataset required a total of 1121 minutes (18 hours, 41 minutes) to run on a FAT node of the ZORBAS HPC - IMBBC - HCMR.
FAT node Specs : Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz, 40 CPUs, 503gb RAM OS = Debian 4.19.146-1
To transform the matrix of the DADA2 output to a long format use the following command. This one-liner creates an asv id for each asv sequence, which is the header of the file. Use this if the DADA2 matrix/array object was saved as table instead of RDS. This creates a matrix that has N colunm names but N+1 columns.
gawk -F"\t" 'BEGIN{print "file" "\t" "asv_id" "\t" "abundance"}(NR==1){split($0,asv,"\t")}(NR>1){split($0, data, "\t"); for (i=2; i<=length(data); ++i){print data[1] "\t" "asv"i-1 "\t" data[i]}}' seqtab_nochim.tsv > seqtab_nochim_long.tsv
A companion one-liner that contains the sequences of asvs and their created ids.
gawk -F"\t" 'BEGIN{print "asv_id" "\t" "asv"}(NR==1){split($0,asv,"\t"); for (i in asv){print "asv"i "\t" asv[i]}}' seqtab_nochim.tsv > asv_fasta_ids.tsv
Transform to fasta format
gawk -F"\t" '(NR>1){gsub(/"/,"",$2); print ">"$1 "\n" $2}' asv_fasta_ids.tsv > asv_fasta_ids.fasta
Each xml
file, i.e. each sample, has a total of 43 attributes which are
source material identifiers
organism
total nitrogen method
ENA-SUBMISSION
geographic location (region and locality)
environment (biome)
total organic C method
ENA-SUBMITTED-FILES
ENA-EXPERIMENT
ENA-FASTQ-FILES
target subfragment
common name
total organic carbon
DNA concentration
vegetation zone
total nitrogen
soil environmental package
environment (material)
amount or size of sample collected
geographic location (depth)
environment (feature)
ENA-LAST-UPDATE
sample collection device or method
place name
ENA-FIRST-PUBLIC
project name
collection date
target gene
ENA-CHECKLIST
geographic location (country and/or sea)
ENA-STUDY
pcr primers
sequencing method
water content method
storage conditions (fresh/frozen/other)
water content
investigation type
geographic location (elevation)
sample volume or weight for DNA extraction
geographic location (latitude)
geographic location (longitude)
current land use (emp 500 soil)
ENA-RUN
Spatial data from Copernicus system are also used. These are the CORINE Land Cover and the Digital Elevation Models. CORINE Land Cover data are used to identify human pressures on the Natura2000 regions and on the hotspots of the arthropod endemic taxa.
The protected areas of Natura2000 SCI (habitats directive) and Wildlife Refuges are used in this analysis.
In the IMBBC HPC facility we created conda
environments for reproducibility and
flexibility of the tools installations.
PEMA DADA2 vegan tidyverse
ElementTree
GAWK R Python Conda/Bioconda
Most computations were performed in the Zorbas HPC facility of IMBBC-HCMR, see here for more info.
GNU GPLv3 license (for 3rd party scripts separate licenses apply).