Skip to content

Commit eac3cd6

Browse files
authored
GRIMER version 1.1.0 (#5)
* bump v1.1.0, fix bugs, new manuals * fix header int to str * docs * export for tables, more docs * control as metadata field-value, export all cols obstable * readme * readme * docs on root * multiple fields metadata prevalance decontam * images manual * first user manual version * Update README.md * docs, add export to help button * revised docs, new logo and params
1 parent 163a949 commit eac3cd6

38 files changed

+1017
-205
lines changed

README.md

Lines changed: 6 additions & 143 deletions
Original file line numberDiff line numberDiff line change
@@ -1,154 +1,17 @@
1-
# GRIMER
2-
31
![GRIMER](grimer/img/logo.png)
42

5-
GRIMER performs analysis of microbiome data and generates a portable and interactive dashboard integrating annotation, taxonomy and metadata with focus on contamination detection. More information about the method can be found in the [pre-print](https://doi.org/10.1101/2021.06.22.449360)
6-
7-
## Examples
8-
9-
Online examples of reports generated with GRIMER: https://pirovc.github.io/grimer-reports/
10-
11-
## Installation
12-
13-
Via conda
14-
15-
```bash
16-
conda install -c bioconda -c conda-forge grimer
17-
```
18-
19-
or locally installing only dependencies via conda:
20-
21-
```bash
22-
git clone https://github.com/pirovc/grimer.git
23-
cd grimer
24-
conda env create -f env.yaml # or mamba env create -f env.yaml
25-
conda activate grimer # or source activate grimer
26-
python setup.py install --record files.txt # Uninstall: xargs rm -rf < files.txt
27-
grimer -h
28-
```
29-
30-
## Usage
31-
32-
### Tab-separated input table
33-
```bash
34-
grimer -i input_table.tsv
35-
```
36-
37-
### BIOM file
38-
```bash
39-
grimer -i myfile.biom
40-
```
41-
42-
### Tab-separated input table with taxonomic annotated observations (e.g. sk__Bacteria;k__;p__Actinobacteria;c__Actinobacteria...)
43-
```bash
44-
grimer -i input_table.tsv -f ";"
45-
```
46-
47-
### Tab-separated input table with metadata
48-
```bash
49-
grimer -i input_table.tsv -m metadata.tsv
50-
```
3+
GRIMER performs analysis of microbiome studies and generates a portable and interactive dashboard integrating annotation, taxonomy and metadata with focus on contamination detection.
514

52-
### With taxonomy integration (ncbi)
53-
```bash
54-
grimer -i input_table.tsv -m metadata.tsv -t ncbi #optional -b taxdump.tar.gz
55-
```
5+
- [Installation, user manual](https://pirovc.github.io/grimer/)
6+
- [Live examples](https://pirovc.github.io/grimer/examples/)
7+
- [Pre-print](https://doi.org/10.1101/2021.06.22.449360)
568

57-
### With configuration file to setup external tools, references and annotations
58-
```bash
59-
grimer -i input_table.tsv -m metadata.tsv -t ncbi -c config/default.yaml -d -g
60-
```
619

62-
### Analyzing any MGnify public study
63-
64-
```bash
65-
./grimer-mgnify.py -i MGYS00006024 -o output_folder/
66-
```
67-
68-
## Parameters
69-
70-
grimer
71-
72-
optional arguments:
73-
-h, --help show this help message and exit
74-
-v, --version show program's version number and exit
75-
76-
required arguments:
77-
-i INPUT_FILE, --input-file INPUT_FILE
78-
Main input table with counts (Observation table, Count table, Contingency Tables, ...) or .biom file. By default rows contain observations and columns contain
79-
samples (use --tranpose if your file is reversed). First column and first row are used as headers.
80-
81-
main arguments:
82-
-m METADATA_FILE, --metadata-file METADATA_FILE
83-
Input metadata file in simple tabular format with samples in rows and metadata fields in columns. QIIME 2 metadata format is also accepted, with an extra row to
84-
define categorical and numerical fields. If not provided and --input-file is a .biom files, will attempt to get metadata from it.
85-
-t {ncbi,gtdb,silva,greengenes,ott}, --taxonomy {ncbi,gtdb,silva,greengenes,ott}
86-
Define taxonomy to convert entry and annotate samples. Will automatically download and parse or files can be provided with --tax-files.
87-
-b [TAX_FILES ...], --tax-files [TAX_FILES ...]
88-
Optional specific taxonomy files to use.
89-
-r [RANKS ...], --ranks [RANKS ...]
90-
Taxonomic ranks to generate visualizations. Use 'default' to use entries from the table directly. Default: default
91-
-c CONFIG, --config CONFIG
92-
Configuration file with definitions of references, controls and external tools.
93-
94-
output arguments:
95-
-g, --mgnify Plot MGnify chart
96-
-d, --decontam Run and plot DECONTAM
97-
-l TITLE, --title TITLE
98-
Title to display on the header of the report.
99-
-p [{overview,samples,heatmap,correlation} ...], --output-plots [{overview,samples,heatmap,correlation} ...]
100-
Plots to generate. Default: overview,samples,heatmap,correlation
101-
-o OUTPUT_HTML, --output-html OUTPUT_HTML
102-
File to output report. Default: output.html
103-
--full-offline Embed javascript library in the output file. File will be around 1.5MB bigger but also work without internet connection. That way your report will live forever.
104-
105-
general data options:
106-
-f LEVEL_SEPARATOR, --level-separator LEVEL_SEPARATOR
107-
If provided, consider --input-table to be a hierarchical multi-level table where the observations headers are separated by the indicated separator characther
108-
(usually ';' or '|')
109-
-y VALUES, --values VALUES
110-
Force 'count' or 'normalized' data parsing. Empty to auto-detect.
111-
-w, --cumm-levels Activate if input table has already cummulative values among levels.
112-
-s, --transpose Transpose --input-table (if samples are listed on columns and observations on rows)
113-
-u [UNASSIGNED_HEADER ...], --unassigned-header [UNASSIGNED_HEADER ...]
114-
Define one or more header names containing unsassinged/unclassified counts.
115-
--obs-replace [OBS_REPLACE ...]
116-
Replace values on table observations labels/headers (support regex). Example: '_' ' ' will replace underscore with spaces, '^.+__' '' will remove the matching
117-
regex.
118-
--sample-replace [SAMPLE_REPLACE ...]
119-
Replace values on table sample labels/headers (support regex). Example: '_' ' ' will replace underscore with spaces, '^.+__' '' will remove the matching regex.
120-
-z REPLACE_ZEROS, --replace-zeros REPLACE_ZEROS
121-
INT (add 'smallest count'/INT to every raw count), FLOAT (add FLOAT to every raw count). Default: 1000
122-
--min-frequency MIN_FREQUENCY
123-
Define minimum number/percentage of samples containing an observation to keep the observation [values between 0-1 for percentage, >1 specific number].
124-
--max-frequency MAX_FREQUENCY
125-
Define maximum number/percentage of samples containing an observation to keep the observation [values between 0-1 for percentage, >1 specific number].
126-
--min-count MIN_COUNT
127-
Define minimum number/percentage of counts to keep an observation [values between 0-1 for percentage, >1 specific number].
128-
--max-count MAX_COUNT
129-
Define maximum number/percentage of counts to keep an observation [values between 0-1 for percentage, >1 specific number].
130-
131-
Samples options:
132-
-j TOP_OBS_BARS, --top-obs-bars TOP_OBS_BARS
133-
Top abundant observations to show in the bars.
134-
135-
Heatmap and clustering options:
136-
-a TRANSFORMATION, --transformation TRANSFORMATION
137-
none (counts), norm (percentage), log (log10), clr (centre log ratio). Default: log
138-
-e METADATA_COLS, --metadata-cols METADATA_COLS
139-
How many metadata cols to show on the heatmap. Higher values makes plot slower to navigate.
140-
--optimal-ordering Activate optimal_ordering on linkage, takes longer for large number of samples.
141-
--show-zeros Do not skip zeros on heatmap. File will be bigger and iteraction with heatmap slower.
142-
--linkage-methods [{single,complete,average,centroid,median,ward,weighted} ...]
143-
--linkage-metrics [{braycurtis,canberra,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,jensenshannon,kulsinski,mahalanobis,minkowski,rogerstanimoto,russellrao,seuclidean,sokalmichener,sokalsneath,sqeuclidean,wminkowski,yule} ...]
144-
--skip-dendrogram Disable dendogram. Will create smaller files.
145-
146-
Correlation options:
147-
-x TOP_OBS_CORR, --top-obs-corr TOP_OBS_CORR
148-
Top abundant observations to build the correlationn matrix, based on the avg. percentage counts/sample. 0 for all
10+
![recording-(2)5](https://user-images.githubusercontent.com/4673375/211857099-c9492232-c5f8-444e-aa68-70d6db8c82b4.gif)
14911

15012
## Powered by
15113

14+
15215
[<img src="https://static.bokeh.org/branding/logos/bokeh-logo.png" height="60">](https://bokeh.org)
15316
[<img src="https://pandas.pydata.org/static/img/pandas.svg" height="40">](https://pandas.org)
15417
[<img src="https://raw.githubusercontent.com/scipy/scipy/master/doc/source/_static/logo.svg" height="40">](https://scipy.org)

config/default.yaml

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,12 @@ references:
22
"Contaminants": "files/contaminants.yml"
33
"Human-related": "files/human-related.yml"
44

5-
#controls:
6-
# "Positve Controls": "path/file1.tsv"
7-
# "Negative Controls": "path/file1.tsv"
5+
# controls:
6+
# "Negative Controls": "path/file1.tsv"
7+
# "Positve Controls":
8+
# "Metadata_Field":
9+
# - "Metadata_Value1"
10+
# - "Metadata_Value2"
811

912
external:
1013
mgnify: "files/mgnify5989.tsv"
@@ -19,6 +22,9 @@ external:
1922
# - "path/file1.txt"
2023
# - "path/file2.txt"
2124
# prevalence_metadata:
22-
# "Field1": "ValueA"
23-
# "Field2": "ValueB"
25+
# "Field1":
26+
# - "ValueA"
27+
# - "ValueB"
28+
# "Field2":
29+
# - "ValueC"
2430

docs/config.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Configuration file
2+
3+
GRIMER uses a configuration file to set reference sources of annotation (e.g. contaminants), controls and external tools (decontam, mgnify). The configuration can be provided with the argument `-c/--config` and it should be in the [YAML](https://yaml.org/){ target="_blank" } format.
4+
5+
A basic example of a configuration file:
6+
7+
```yaml
8+
references:
9+
"Contaminants": "files/contaminants.yml"
10+
"Human-related": "files/human-related.yml"
11+
12+
controls:
13+
"Negative Controls": "path/file1.tsv"
14+
"Positve Controls":
15+
"Metadata_Field":
16+
- "Metadata_Value1"
17+
- "Metadata_Value2"
18+
19+
external:
20+
mgnify: "files/mgnify5989.tsv"
21+
decontam:
22+
threshold: 0.1
23+
method: "frequency"
24+
```
25+
26+
## references
27+
28+
References can be provided as an external `.yml/.yaml` file in a specific format (see below) or as a text file with one taxonomic identifier or taxonomic name per line.
29+
30+
```yaml
31+
"General Description":
32+
"Specific description":
33+
url: "www.website.com?id={}"
34+
ids: [1,2,3]
35+
```
36+
37+
A real example of saliva organisms extracted from BacDive (NCBI taxonomic ids):
38+
39+
```yaml
40+
"Human-related bacterial isolates from BacDive":
41+
"Saliva":
42+
url: "https://bacdive.dsmz.de/search?search=taxid:{}"
43+
ids: [152331, 113107, 157688, 979627, 45634, 60133, 157687, 1624, 1583331, 1632, 249188]
44+
```
45+
46+
Common contaminants compiled from the literature and human-related possible sources of contamination are available in the [GRIMER repository](https://github.com/pirovc/grimer/tree/main/files){ target="_blank" }. For more information, please refer to the [pre-print](https://doi.org/10.1101/2021.06.22.449360){ target="_blank" }. If the target study overlaps with some of those annotation (e.g. study of human skin), related entries can be easily removed from the provided files to not generate redundant annotations.
47+
48+
## controls
49+
50+
Several control groups can be provided to annotate samples. They can be provided as a file with one sample identifier per line:
51+
52+
```yaml
53+
controls:
54+
"Controls": "controls.txt"
55+
```
56+
57+
or directly from the metadata (`-m/--metadata-file`) as a field and value(s) information:
58+
59+
```yaml
60+
controls:
61+
"Other Controls":
62+
"sample_type": # field
63+
- "blank" # value
64+
- "control" # value
65+
```
66+
67+
Both methods can be combined into one configuration file.
68+
69+
## external
70+
71+
Set the configuration and functionality of external tools executed by GRIMER.
72+
73+
### mgnify
74+
75+
GRIMER uses a parsed MGnify database to annotate observations and link them to the respective MGnify repository, reporting most common biome occurrences. Instructions on how to re-generate the parsed database from MGnify can be found [here](https://github.com/pirovc/grimer/tree/main/files#mgnify){ target="_blank" }.
76+
77+
A [pre-parsed database](https://raw.githubusercontent.com/pirovc/grimer/main/files/mgnify5989.tsv){ target="_blank" } is available in the GRIMER repository (generated on 2022-03-09). To use it, please set the file in the configuration as follows and activate it with the `-g/--mgnify` when running GRIMER.
78+
79+
```yaml
80+
external:
81+
mgnify: "files/mgnify5989.tsv"
82+
```
83+
84+
### decontam
85+
86+
GRIMER can run [DECONTAM](https://benjjneb.github.io/decontam/){ target="_blank" } with `-d/--decontam`, but some configuration is necessary. It is possible to set the threshold (P* hyperparameter) and the method (frequency, prevalence, combined).
87+
88+
For the frequency/combined method, DNA frequencies for each sample have to be provided either in a `.tsv` file (sample identifier <tab> frequency) or as a metadata field. If none is provided, the sum of all counts in the input table is used for the frequency calculation.
89+
90+
For the prevalence/combined method, file(s) with a list of sample identifiers or a metadata field/value can be provided. If none is provided, all samples defined in the "controls" are considered for the prevalence calculation.
91+
92+
Below an example of how to set-up the configuration file for DECONTAM:
93+
94+
```yaml
95+
external:
96+
decontam:
97+
threshold: 0.1 # P* hyperparameter threshold, values between 0 and 1
98+
method: "frequency" # Options: frequency, prevalence, combined
99+
frequency_file: "path/file1.txt"
100+
# frequency_metadata: "Field1"
101+
# prevalence_file:
102+
# - "path/file1.txt"
103+
# - "path/file2.txt"
104+
prevalence_metadata:
105+
"Field1":
106+
- "ValueA"
107+
- "ValueB"
108+
"Field2":
109+
- "ValueC"
110+
```
111+
112+
## Using the configuration file
113+
114+
Example [UgandaMaternalV3V4.16s_DADA2.taxon_abundance.biom](https://microbiomedb.org/common/downloads/release-31/c66d2dc8473138e3a737ef2ad0b25f1e6e9c0f22/UgandaMaternalV3V4.16s_DADA2.taxon_abundance.biom){ target="_blank" } file from [microbiomedb.org](https://microbiomedb.org){ target="_blank" }
115+
116+
config.yml (external .yml files are available in the [GRIMER repository](https://github.com/pirovc/grimer/tree/main/files){ target="_blank" })
117+
118+
```yml
119+
references:
120+
"Contaminants": "files/contaminants.yml"
121+
"Human-related": "files/human-related.yml"
122+
123+
external:
124+
mgnify: "files/mgnify5989.tsv"
125+
decontam:
126+
threshold: 0.1 # [0-1] P* hyperparameter
127+
method: "frequency" # frequency, prevalence, combined
128+
```
129+
130+
Running GRIMER with DECONTAM and MGnify integration
131+
132+
```bash
133+
grimer --input-file UgandaMaternalV3V4.16s_DADA2.taxon_abundance.biom \
134+
--config config.yml \
135+
--decontam --mgnify \
136+
--taxonomy ncbi \
137+
--ranks superkingdom phylum class order family genus species
138+
```

0 commit comments

Comments
 (0)