You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An update of templates/analyse_mapping.py. In place of plain Python code, Pysam module is used for reading SAM input file. The output format stays unchanged. Usage of Pysam necessitated a split of conda workflow-env into two separate ones: workflow-py and workflow-r. That split entailed adjustements in README.md, nextflow.config, main.nf and Dockerfile.
Copy file name to clipboardExpand all lines: README.md
+15-12Lines changed: 15 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,11 +11,12 @@ A simple Nextflow workflow designed to map short sequences to human genome and p
11
11
3.[Running the workflow](#3)
12
12
13
13
### <aname="1">1. Environment setup</a>
14
-
The workflow is intendent to be run in Bash on Linux operating systems. Miniconda or Anaconda installation is required. The workflow has been tested using Miniconda installation (conda 23.3.1) and the following packages:
15
-
* python 3.9.16
14
+
The workflow is intendent to be run in Bash on Linux operating systems. Miniconda or Anaconda installation is required. The workflow has been tested using Miniconda installation (conda 23.5.0) and the following packages:
15
+
* python 3.10.11
16
16
* pip 23.1.2
17
17
* numpy 1.24.3
18
18
* pandas 2.0.2
19
+
* pysam 0.21.0
19
20
* pyensembl 2.2.8
20
21
* r-base 4.2.0
21
22
* bioconductor-tcgabiolinks 2.25.3
@@ -29,21 +30,22 @@ To run the workflow three steps must be taken. Firstly, Nextflow must be install
Important: `params.condaEnv`in `nextflow.config` file must indicated the path of the `workflow-env`. The default setting is`params.condaEnv = '/miniconda3/envs/workflow-env'`, and it is fit for usage in a Docker container. If you use the workflow in another way, please remember to change that to a valid path.
38
+
Important: `params.condaEnvPy` and `params.condaEnvR`in `nextflow.config` file must indicated the path of the `workflow-py` and `workflow-r`, respectively. The default settings are`params.condaEnvPy = '/miniconda3/envs/workflow-py'` and `params.condaEnvR = '/miniconda3/envs/workflow-r'`, and it is fit for usage in a Docker container. If you use the workflow in another way, please remember to change those to valid paths of existing conda environments.
37
39
38
-
Finally, the `pyensembl` package is supposed to be installed using `pip` into the `workflow-env` environment:
40
+
Finally, the `pyensembl` package is supposed to be installed using `pip` into the `workflow-py` environment:
##### <aname="1.2">1.2. Automatic environment setup with Docker</a>
@@ -57,15 +59,16 @@ Then you can create a container and run it, e.g. interactively like this:
57
59
docker run -it workflow-ubuntu:22.04
58
60
```
59
61
60
-
You can download a ready-to-use `workflow-ubuntu:22.04` image [here](https://drive.google.com/file/d/1hm3M41m0Ps8cAvBeXfOuJvnovGW47ezE/view?usp=drive_link) (2.3 GB).
62
+
You can download a ready-to-use `workflow-ubuntu:22.04` image [here](https://drive.google.com/file/d/1i_Q9ittRX2utEBnbYEsc_xJ_tEzo9IG2/view?usp=drive_link) (2.6 GB).
Below you will find a tree of all workflow files that are provided. When the workflow is launched, the output files will be published in a subdirectory named `output`.
65
67
```
66
68
<workflow_location>/
67
69
├── conda/
68
-
│ └── workflow-env.txt
70
+
│ ├── workflow-py.txt
71
+
│ └── workflow-r.txt
69
72
├── input/
70
73
│ ├── library.fa
71
74
│ └── TCGA_samples.txt
@@ -96,7 +99,7 @@ The workflow consists of the following stages/processes:
96
99
| 1. |`buildIndex`| Using `bowtie-build`, builds reference sequence index from sequences in the input `params.genomeFastaFile`. Uses `index/genome` as the index prefix and saves the index to the `output` subdirectory. |
97
100
| 2. |`mapReads`| Using `bowtie2`, maps reads from the `params.readsFile` to `params.genomeFastaFile` reference. Saves the results to a gzipped SAM file `output/mapping.sam.gz`. |
98
101
| 3. |`filterMapping`| Using `samtools view`, filters the mapping results in respect to MAPQ values (>= 30). Saves the results to a gzipped SAM file `output/mapping_filtered.sam.gz`. |
99
-
| 4. |`analyseMapping`| Using `templates/analyse_mapping.py` Python script, analyses filtered mapping results in order to calculate the end positions of mapped reads (based on CIGAR values) and the strand reads were mapped to (based on FLAG values). Saves the results to a gzipped TSV file `mapping_analysis.tsv.gz`. Next to QNAME, FLAG, RNAME, POS, MAPQ, CIGAR columns from the SAM input file (names are converted to lower case: `qname`, `flag`, `rname`, `pos`, `mapq`, `cigar`), renders the `end` (based on CIGAR) and `strand` (based on FLAG) columns that denote respectively the end locations of reads within the reference sequence and the strand of the reference sequence reads were mapped to. |
102
+
| 4. |`analyseMapping`| Using `templates/analyse_mapping.py` Python script that utilises Pysam module, analyses filtered mapping results in order to calculate the end positions of mapped reads (based on CIGAR values) and the strand reads were mapped to (based on FLAG values). Saves the results to a gzipped TSV file `mapping_analysis.tsv.gz`. Next to QNAME, FLAG, RNAME, POS, MAPQ, CIGAR columns from the SAM input file (names are converted to lower case: `qname`, `flag`, `rname`, `pos`, `mapq`, `cigar`), renders the `end` (based on CIGAR) and `strand` (based on FLAG) columns that denote respectively the end locations of reads within the reference sequence and the strand of the reference sequence reads were mapped to. The `pos` and `end` are 1-based and both inclusive, which corresponds to GenBank notation. |
100
103
| 5. |`analyseGenes`| Using `templates/analyse_genes.py` Python script that utilises PyEnsembl module, obtains information of genes the input reads were mapped within. It uses `params.genomeGtfFile` that indicates the location of the file with annotations for the reference sequences. Saves the results to a gzipped TSV file `gene_analysis.tsv.gz`. The output file contains `qname` column (a read sequence id) next to `gene_names` and `gene_ids` columns that contain respectively gene names and their ids obtained from Ensembl database. If there is more than one gene in the locus where a given read was mapped, names/ids are separated by a semicolon followed by space (`'; '`). The resulting data may be used to check whether the gene name provided in a read sequence id (_qname_) may be found among names obtained from Ensembl database based on a read location. |
101
104
| 6. |`fetchMatrix`| Using `templates/fetch_matrix.r` R script that utilises TCGAbiolinks R Bioconductor module, obtains expression matrices for samples, the name of which are given in the `params.samplesTxtFile`. Saves the results to a gzipped TSV file `gene_matrix.tsv.gz`. The first column of the output file is an index column that contains gene ids (selected during the previous stage), and the remaining columns contain expression data for the samples in the order their ids are provided in the input `params.samplesTxtFile`. |
0 commit comments