pirovc
diff --git a/‎README.md
Lines changed: 6 additions & 143 deletions b/‎README.md
Lines changed: 6 additions & 143 deletions
diff --git a/‎config/default.yaml
Lines changed: 11 additions & 5 deletions b/‎config/default.yaml
Lines changed: 11 additions & 5 deletions
diff --git a/‎docs/config.md
Lines changed: 138 additions & 0 deletions b/‎docs/config.md
Lines changed: 138 additions & 0 deletions
@@ -1,154 +1,17 @@
-# GRIMER
-
 ![GRIMER](grimer/img/logo.png)
 
-GRIMER performs analysis of microbiome data and generates a portable and interactive dashboard integrating annotation, taxonomy and metadata with focus on contamination detection. More information about the method can be found in the [pre-print](https://doi.org/10.1101/2021.06.22.449360)
-
-## Examples
-
-Online examples of reports generated with GRIMER: https://pirovc.github.io/grimer-reports/
-
-## Installation
-
-Via conda
-
-```bash
-conda install -c bioconda -c conda-forge grimer
-```
-
-or locally installing only dependencies via conda:
-
-```bash
-git clone https://github.com/pirovc/grimer.git
-cd grimer
-conda env create -f env.yaml # or mamba env create -f env.yaml
-conda activate grimer # or source activate grimer
-python setup.py install --record files.txt # Uninstall: xargs rm -rf < files.txt
-grimer -h
-```
-
-## Usage
-
-### Tab-separated input table
-```bash
-grimer -i input_table.tsv
-```
-
-### BIOM file
-```bash
-grimer -i myfile.biom
-```
-
-### Tab-separated input table with taxonomic annotated observations (e.g. sk__Bacteria;k__;p__Actinobacteria;c__Actinobacteria...)
-```bash
-grimer -i input_table.tsv -f ";"
-```
-
-### Tab-separated input table with metadata
-```bash
-grimer -i input_table.tsv -m metadata.tsv
-```
+GRIMER performs analysis of microbiome studies and generates a portable and interactive dashboard integrating annotation, taxonomy and metadata with focus on contamination detection. 
 
-### With taxonomy integration (ncbi)
-```bash
-grimer -i input_table.tsv -m metadata.tsv -t ncbi #optional -b taxdump.tar.gz
-```
+- [Installation, user manual](https://pirovc.github.io/grimer/)
+- [Live examples](https://pirovc.github.io/grimer/examples/)
+- [Pre-print](https://doi.org/10.1101/2021.06.22.449360)
 
-### With configuration file to setup external tools, references and annotations
-```bash
-grimer -i input_table.tsv -m metadata.tsv -t ncbi -c config/default.yaml -d -g
-```
 
-### Analyzing any MGnify public study
-
-```bash
-./grimer-mgnify.py -i MGYS00006024 -o output_folder/
-```
-
-## Parameters
-
-	grimer
-
-	optional arguments:
-	  -h, --help            show this help message and exit
-	  -v, --version         show program's version number and exit
-
-	required arguments:
-	  -i INPUT_FILE, --input-file INPUT_FILE
-	                        Main input table with counts (Observation table, Count table, Contingency Tables, ...) or .biom file. By default rows contain observations and columns contain
-	                        samples (use --tranpose if your file is reversed). First column and first row are used as headers.
-
-	main arguments:
-	  -m METADATA_FILE, --metadata-file METADATA_FILE
-	                        Input metadata file in simple tabular format with samples in rows and metadata fields in columns. QIIME 2 metadata format is also accepted, with an extra row to
-	                        define categorical and numerical fields. If not provided and --input-file is a .biom files, will attempt to get metadata from it.
-	  -t {ncbi,gtdb,silva,greengenes,ott}, --taxonomy {ncbi,gtdb,silva,greengenes,ott}
-	                        Define taxonomy to convert entry and annotate samples. Will automatically download and parse or files can be provided with --tax-files.
-	  -b [TAX_FILES ...], --tax-files [TAX_FILES ...]
-	                        Optional specific taxonomy files to use.
-	  -r [RANKS ...], --ranks [RANKS ...]
-	                        Taxonomic ranks to generate visualizations. Use 'default' to use entries from the table directly. Default: default
-	  -c CONFIG, --config CONFIG
-	                        Configuration file with definitions of references, controls and external tools.
-
-	output arguments:
-	  -g, --mgnify          Plot MGnify chart
-	  -d, --decontam        Run and plot DECONTAM
-	  -l TITLE, --title TITLE
-	                        Title to display on the header of the report.
-	  -p [{overview,samples,heatmap,correlation} ...], --output-plots [{overview,samples,heatmap,correlation} ...]
-	                        Plots to generate. Default: overview,samples,heatmap,correlation
-	  -o OUTPUT_HTML, --output-html OUTPUT_HTML
-	                        File to output report. Default: output.html
-	  --full-offline        Embed javascript library in the output file. File will be around 1.5MB bigger but also work without internet connection. That way your report will live forever.
-
-	general data options:
-	  -f LEVEL_SEPARATOR, --level-separator LEVEL_SEPARATOR
-	                        If provided, consider --input-table to be a hierarchical multi-level table where the observations headers are separated by the indicated separator characther
-	                        (usually ';' or '|')
-	  -y VALUES, --values VALUES
-	                        Force 'count' or 'normalized' data parsing. Empty to auto-detect.
-	  -w, --cumm-levels     Activate if input table has already cummulative values among levels.
-	  -s, --transpose       Transpose --input-table (if samples are listed on columns and observations on rows)
-	  -u [UNASSIGNED_HEADER ...], --unassigned-header [UNASSIGNED_HEADER ...]
-	                        Define one or more header names containing unsassinged/unclassified counts.
-	  --obs-replace [OBS_REPLACE ...]
-	                        Replace values on table observations labels/headers (support regex). Example: '_' ' ' will replace underscore with spaces, '^.+__' '' will remove the matching
-	                        regex.
-	  --sample-replace [SAMPLE_REPLACE ...]
-	                        Replace values on table sample labels/headers (support regex). Example: '_' ' ' will replace underscore with spaces, '^.+__' '' will remove the matching regex.
-	  -z REPLACE_ZEROS, --replace-zeros REPLACE_ZEROS
-	                        INT (add 'smallest count'/INT to every raw count), FLOAT (add FLOAT to every raw count). Default: 1000
-	  --min-frequency MIN_FREQUENCY
-	                        Define minimum number/percentage of samples containing an observation to keep the observation [values between 0-1 for percentage, >1 specific number].
-	  --max-frequency MAX_FREQUENCY
-	                        Define maximum number/percentage of samples containing an observation to keep the observation [values between 0-1 for percentage, >1 specific number].
-	  --min-count MIN_COUNT
-	                        Define minimum number/percentage of counts to keep an observation [values between 0-1 for percentage, >1 specific number].
-	  --max-count MAX_COUNT
-	                        Define maximum number/percentage of counts to keep an observation [values between 0-1 for percentage, >1 specific number].
-
-	Samples options:
-	  -j TOP_OBS_BARS, --top-obs-bars TOP_OBS_BARS
-	                        Top abundant observations to show in the bars.
-
-	Heatmap and clustering options:
-	  -a TRANSFORMATION, --transformation TRANSFORMATION
-	                        none (counts), norm (percentage), log (log10), clr (centre log ratio). Default: log
-	  -e METADATA_COLS, --metadata-cols METADATA_COLS
-	                        How many metadata cols to show on the heatmap. Higher values makes plot slower to navigate.
-	  --optimal-ordering    Activate optimal_ordering on linkage, takes longer for large number of samples.
-	  --show-zeros          Do not skip zeros on heatmap. File will be bigger and iteraction with heatmap slower.
-	  --linkage-methods [{single,complete,average,centroid,median,ward,weighted} ...]
-	  --linkage-metrics [{braycurtis,canberra,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,jensenshannon,kulsinski,mahalanobis,minkowski,rogerstanimoto,russellrao,seuclidean,sokalmichener,sokalsneath,sqeuclidean,wminkowski,yule} ...]
-	  --skip-dendrogram     Disable dendogram. Will create smaller files.
-
-	Correlation options:
-	  -x TOP_OBS_CORR, --top-obs-corr TOP_OBS_CORR
-	                        Top abundant observations to build the correlationn matrix, based on the avg. percentage counts/sample. 0 for all
+![recording-(2)5](https://user-images.githubusercontent.com/4673375/211857099-c9492232-c5f8-444e-aa68-70d6db8c82b4.gif)
 
 ## Powered by
 
+
 [<img src="https://static.bokeh.org/branding/logos/bokeh-logo.png" height="60">](https://bokeh.org)
 [<img src="https://pandas.pydata.org/static/img/pandas.svg" height="40">](https://pandas.org)
 [<img src="https://raw.githubusercontent.com/scipy/scipy/master/doc/source/_static/logo.svg" height="40">](https://scipy.org)
 
@@ -2,9 +2,12 @@ references:
   "Contaminants": "files/contaminants.yml"
   "Human-related": "files/human-related.yml" 
 
-#controls:
-  # "Positve Controls": "path/file1.tsv"
-  # "Negative Controls": "path/file1.tsv"
+# controls:
+#   "Negative Controls": "path/file1.tsv"
+#   "Positve Controls": 
+#     "Metadata_Field": 
+#       - "Metadata_Value1"
+#       - "Metadata_Value2"
 
 external:
   mgnify: "files/mgnify5989.tsv"
@@ -19,6 +22,9 @@ external:
     #  - "path/file1.txt"
     #  - "path/file2.txt"
     # prevalence_metadata: 
-    #  "Field1": "ValueA"
-    #  "Field2": "ValueB"
+    #  "Field1": 
+    #    - "ValueA"
+    #    - "ValueB"
+    #  "Field2":
+    #    - "ValueC"
 
@@ -0,0 +1,138 @@
+# Configuration file
+
+GRIMER uses a configuration file to set reference sources of annotation (e.g. contaminants), controls and external tools (decontam, mgnify). The configuration can be provided with the argument `-c/--config` and it should be in the [YAML](https://yaml.org/){ target="_blank" } format.
+
+A basic example of a configuration file:
+
+```yaml
+references:
+  "Contaminants": "files/contaminants.yml"
+  "Human-related": "files/human-related.yml" 
+
+controls:
+  "Negative Controls": "path/file1.tsv"
+  "Positve Controls": 
+    "Metadata_Field": 
+      - "Metadata_Value1"
+      - "Metadata_Value2"
+
+external:
+  mgnify: "files/mgnify5989.tsv"
+  decontam:
+    threshold: 0.1
+    method: "frequency"
+```
+
+## references
+
+References can be provided as an external `.yml/.yaml` file in a specific format (see below) or as a text file with one taxonomic identifier or taxonomic name per line.
+
+```yaml
+"General Description":
+  "Specific description":
+    url: "www.website.com?id={}" 
+    ids: [1,2,3]
+```
+
+A real example of saliva organisms extracted from BacDive (NCBI taxonomic ids):
+
+```yaml
+"Human-related bacterial isolates from BacDive":
+  "Saliva":
+    url: "https://bacdive.dsmz.de/search?search=taxid:{}"
+    ids: [152331, 113107, 157688, 979627, 45634, 60133, 157687, 1624, 1583331, 1632, 249188]
+```
+
+Common contaminants compiled from the literature and human-related possible sources of contamination are available in the [GRIMER repository](https://github.com/pirovc/grimer/tree/main/files){ target="_blank" }. For more information, please refer to the [pre-print](https://doi.org/10.1101/2021.06.22.449360){ target="_blank" }. If the target study overlaps with some of those annotation (e.g. study of human skin), related entries can be easily removed from the provided files to not generate redundant annotations.
+
+## controls
+
+Several control groups can be provided to annotate samples. They can be provided as a file with one sample identifier per line:
+
+```yaml
+controls:
+  "Controls": "controls.txt"
+```
+
+or directly from the metadata (`-m/--metadata-file`) as a field and value(s) information:
+
+```yaml
+controls:
+  "Other Controls": 
+    "sample_type": #  field
+      - "blank"    #  value
+      - "control"  #  value
+```
+
+Both methods can be combined into one configuration file.
+
+## external
+
+Set the configuration and functionality of external tools executed by GRIMER.
+
+### mgnify
+
+GRIMER uses a parsed MGnify database to annotate observations and link them to the respective MGnify repository, reporting most common biome occurrences. Instructions on how to re-generate the parsed database from MGnify can be found [here](https://github.com/pirovc/grimer/tree/main/files#mgnify){ target="_blank" }.
+
+A [pre-parsed database](https://raw.githubusercontent.com/pirovc/grimer/main/files/mgnify5989.tsv){ target="_blank" } is available in the GRIMER repository (generated on 2022-03-09). To use it, please set the file in the configuration as follows and activate it with the `-g/--mgnify` when running GRIMER.
+
+```yaml
+external:
+  mgnify: "files/mgnify5989.tsv"
+```
+
+### decontam
+
+GRIMER can run [DECONTAM](https://benjjneb.github.io/decontam/){ target="_blank" } with `-d/--decontam`, but some configuration is necessary. It is possible to set the threshold (P* hyperparameter) and the method (frequency, prevalence, combined).
+
+For the frequency/combined method, DNA frequencies for each sample have to be provided either in a `.tsv` file (sample identifier <tab> frequency) or as a metadata field. If none is provided, the sum of all counts in the input table is used for the frequency calculation.
+
+For the prevalence/combined method, file(s) with a list of sample identifiers or a metadata field/value can be provided. If none is provided, all samples defined in the "controls" are considered for the prevalence calculation.
+
+Below an example of how to set-up the configuration file for DECONTAM:
+
+```yaml
+external:
+  decontam:
+    threshold: 0.1 # P* hyperparameter threshold, values between 0 and 1
+    method: "frequency" # Options: frequency, prevalence, combined
+    frequency_file: "path/file1.txt"
+    # frequency_metadata: "Field1"
+    # prevalence_file: 
+    #  - "path/file1.txt"
+    #  - "path/file2.txt"
+    prevalence_metadata: 
+     "Field1":
+      - "ValueA"
+      - "ValueB"
+      "Field2":
+        - "ValueC"
+```
+
+## Using the configuration file
+
+Example [UgandaMaternalV3V4.16s_DADA2.taxon_abundance.biom](https://microbiomedb.org/common/downloads/release-31/c66d2dc8473138e3a737ef2ad0b25f1e6e9c0f22/UgandaMaternalV3V4.16s_DADA2.taxon_abundance.biom){ target="_blank" } file from [microbiomedb.org](https://microbiomedb.org){ target="_blank" }
+
+config.yml (external .yml files are available in the [GRIMER repository](https://github.com/pirovc/grimer/tree/main/files){ target="_blank" })
+
+```yml
+references:
+  "Contaminants": "files/contaminants.yml"
+  "Human-related": "files/human-related.yml" 
+
+external:
+  mgnify: "files/mgnify5989.tsv"
+  decontam:
+    threshold: 0.1 # [0-1] P* hyperparameter
+    method: "frequency" # frequency, prevalence, combined
+```
+
+Running GRIMER with DECONTAM and MGnify integration
+
+```bash
+grimer --input-file UgandaMaternalV3V4.16s_DADA2.taxon_abundance.biom \
+       --config config.yml \
+       --decontam --mgnify \
+       --taxonomy ncbi \
+       --ranks superkingdom phylum class order family genus species
+```