Skip to content

Commit 11b5d73

Browse files
An update of templates/analyse_mapping.py. In place of plain Python code, Pysam module is used for reading SAM input file. The output format stays unchanged. Usage of Pysam necessitated a split of conda workflow-env into two separate ones: workflow-py and workflow-r. That split entailed adjustements in README.md, nextflow.config, main.nf and Dockerfile.
1 parent d0b7434 commit 11b5d73

File tree

7 files changed

+259
-229
lines changed

7 files changed

+259
-229
lines changed

Dockerfile

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,20 @@
77
FROM ubuntu:22.04
88

99
# Set arguments identyfying image paths for the Nextflow workflow directory
10-
# (WORKFLOW_DIR), Miniconda directory (MINICONDA_DIR) and the name for the
11-
# conda environment that will be used by the workflow (ENV_NAME).
12-
# Important: MINICONDA_DIR/envs/ENV_NAME must be equivalent to the workflow
13-
# conda environment location defined in nextflow.config file.
10+
# (WORKFLOW_DIR), Miniconda directory (MINICONDA_DIR) and names for the
11+
# conda environments that will be used by the workflow (PY_ENV_NAME and R_ENV_NAME).
12+
# Important: MINICONDA_DIR/envs/PY_ENV_NAME and MINICONDA_DIR/envs/R_ENV_NAME
13+
# must be equivalent to locations of conda environments that are used by
14+
# the workflow and are defined in nextflow.config file respectively as
15+
# params.condaEnvPy and params.condaEnvR.
1416
ARG WORKFLOW_DIR=/hg-mapping
1517
ARG MINICONDA_DIR=/miniconda3
16-
ARG ENV_NAME=workflow-env
18+
ARG PY_ENV_NAME=workflow-py
19+
ARG R_ENV_NAME=workflow-r
1720

1821
# Set the working directory to the image workflow location and copy the workflow
1922
# directories and files to that location (including conda subdirectory
20-
# containing the workflow conda environment file).
23+
# containing the workflow conda environment files).
2124
WORKDIR $WORKFLOW_DIR
2225
COPY conda/. conda
2326
COPY input/. input
@@ -30,15 +33,17 @@ COPY nextflow.config ./
3033
# - once Miniconda is installed, remove the installer
3134
# - add Miniconda bin directory to PATH
3235
# - in the base conda environment install Nextflow (ver. 23.04.1)
33-
# - create the workflow conda environment from conda/workflow-env.txt file
34-
# - using pip package manager install in that environment PyEnsembl (ver. 2.2.8)
36+
# - create the workflow conda environments from conda/workflow-py.txt and
37+
# conda/workflow-r.txt files
38+
# - using pip package manager install in the first environment PyEnsembl (ver. 2.2.8)
3539
RUN apt update
3640
RUN apt install -y wget
3741
RUN wget "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"
3842
RUN bash Miniconda3-latest-Linux-x86_64.sh -bp $MINICONDA_DIR
3943
RUN rm Miniconda3-latest-Linux-x86_64.sh
4044
ENV PATH="$MINICONDA_DIR/bin:${PATH}"
4145
RUN conda install -y -c bioconda -c conda-forge nextflow==23.04.1
42-
RUN conda create -y --prefix $MINICONDA_DIR/envs/$ENV_NAME --file conda/workflow-env.txt
43-
RUN yes | $MINICONDA_DIR/envs/$ENV_NAME/bin/pip install pyensembl==2.2.8
46+
RUN conda create -y --prefix $MINICONDA_DIR/envs/$PY_ENV_NAME --file conda/workflow-py.txt
47+
RUN conda create -y --prefix $MINICONDA_DIR/envs/$R_ENV_NAME --file conda/workflow-r.txt
48+
RUN yes | $MINICONDA_DIR/envs/$PY_ENV_NAME/bin/pip install pyensembl==2.2.8
4449

README.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,12 @@ A simple Nextflow workflow designed to map short sequences to human genome and p
1111
3. [Running the workflow](#3)
1212

1313
### <a name="1">1. Environment setup</a>
14-
The workflow is intendent to be run in Bash on Linux operating systems. Miniconda or Anaconda installation is required. The workflow has been tested using Miniconda installation (conda 23.3.1) and the following packages:
15-
* python 3.9.16
14+
The workflow is intendent to be run in Bash on Linux operating systems. Miniconda or Anaconda installation is required. The workflow has been tested using Miniconda installation (conda 23.5.0) and the following packages:
15+
* python 3.10.11
1616
* pip 23.1.2
1717
* numpy 1.24.3
1818
* pandas 2.0.2
19+
* pysam 0.21.0
1920
* pyensembl 2.2.8
2021
* r-base 4.2.0
2122
* bioconductor-tcgabiolinks 2.25.3
@@ -29,21 +30,22 @@ To run the workflow three steps must be taken. Firstly, Nextflow must be install
2930
conda install -c bioconda -c conda-forge nextflow==23.04.1
3031
```
3132

32-
Then `workflow-env` environment should be created using `conda/workflow-env.txt` file:
33+
Then `workflow-py` and `workflow-r` environments should be created using `conda/workflow-py.txt` and `conda/workflow-r.txt` files, respectively:
3334
```bash
34-
conda create --name workflow-env --file conda/workflow-env.txt
35+
conda create --name workflow-py --file conda/workflow-py.txt
36+
conda create --name workflow-r --file conda/workflow-r.txt
3537
```
36-
Important: `params.condaEnv` in `nextflow.config` file must indicated the path of the `workflow-env`. The default setting is `params.condaEnv = '/miniconda3/envs/workflow-env'`, and it is fit for usage in a Docker container. If you use the workflow in another way, please remember to change that to a valid path.
38+
Important: `params.condaEnvPy` and `params.condaEnvR` in `nextflow.config` file must indicated the path of the `workflow-py` and `workflow-r`, respectively. The default settings are `params.condaEnvPy = '/miniconda3/envs/workflow-py'` and `params.condaEnvR = '/miniconda3/envs/workflow-r'`, and it is fit for usage in a Docker container. If you use the workflow in another way, please remember to change those to valid paths of existing conda environments.
3739

38-
Finally, the `pyensembl` package is supposed to be installed using `pip` into the `workflow-env` environment:
40+
Finally, the `pyensembl` package is supposed to be installed using `pip` into the `workflow-py` environment:
3941
```bash
40-
conda activate workflow-env
42+
conda activate workflow-py
4143
pip install pyensembl==2.2.8
42-
conda deactivate workflow-env
44+
conda deactivate
4345
```
4446
or
4547
```bash
46-
<path_to_workflow-env_directory>/bin/pip install pyensembl==2.2.8
48+
<path_to_workflow-py_directory>/bin/pip install pyensembl==2.2.8
4749
```
4850

4951
##### <a name="1.2">1.2. Automatic environment setup with Docker</a>
@@ -57,15 +59,16 @@ Then you can create a container and run it, e.g. interactively like this:
5759
docker run -it workflow-ubuntu:22.04
5860
```
5961

60-
You can download a ready-to-use `workflow-ubuntu:22.04` image [here](https://drive.google.com/file/d/1hm3M41m0Ps8cAvBeXfOuJvnovGW47ezE/view?usp=drive_link) (2.3&nbsp;GB).
62+
You can download a ready-to-use `workflow-ubuntu:22.04` image [here](https://drive.google.com/file/d/1i_Q9ittRX2utEBnbYEsc_xJ_tEzo9IG2/view?usp=drive_link) (2.6&nbsp;GB).
6163

6264
### <a name="2">2. Workflow detailed description</a>
6365
##### <a name="2.1">2.1. Workflow tree</a>
6466
Below you will find a tree of all workflow files that are provided. When the workflow is launched, the output files will be published in a subdirectory named `output`.
6567
```
6668
<workflow_location>/
6769
├── conda/
68-
│ └── workflow-env.txt
70+
│ ├── workflow-py.txt
71+
│ └── workflow-r.txt
6972
├── input/
7073
│ ├── library.fa
7174
│ └── TCGA_samples.txt
@@ -96,7 +99,7 @@ The workflow consists of the following stages/processes:
9699
| 1. | `buildIndex` | Using `bowtie-build`, builds reference sequence index from sequences in the input `params.genomeFastaFile`. Uses `index/genome` as the index prefix and saves the index to the `output` subdirectory. |
97100
| 2. | `mapReads` | Using `bowtie2`, maps reads from the `params.readsFile` to `params.genomeFastaFile` reference. Saves the results to a gzipped SAM file `output/mapping.sam.gz`. |
98101
| 3. | `filterMapping` | Using `samtools view`, filters the mapping results in respect to MAPQ values (>= 30). Saves the results to a gzipped SAM file `output/mapping_filtered.sam.gz`. |
99-
| 4. | `analyseMapping` | Using `templates/analyse_mapping.py` Python script, analyses filtered mapping results in order to calculate the end positions of mapped reads (based on CIGAR values) and the strand reads were mapped to (based on FLAG values). Saves the results to a gzipped TSV file `mapping_analysis.tsv.gz`. Next to QNAME, FLAG, RNAME, POS, MAPQ, CIGAR columns from the SAM input file (names are converted to lower case: `qname`, `flag`, `rname`, `pos`, `mapq`, `cigar`), renders the `end` (based on CIGAR) and `strand` (based on FLAG) columns that denote respectively the end locations of reads within the reference sequence and the strand of the reference sequence reads were mapped to. |
102+
| 4. | `analyseMapping` | Using `templates/analyse_mapping.py` Python script that utilises Pysam module, analyses filtered mapping results in order to calculate the end positions of mapped reads (based on CIGAR values) and the strand reads were mapped to (based on FLAG values). Saves the results to a gzipped TSV file `mapping_analysis.tsv.gz`. Next to QNAME, FLAG, RNAME, POS, MAPQ, CIGAR columns from the SAM input file (names are converted to lower case: `qname`, `flag`, `rname`, `pos`, `mapq`, `cigar`), renders the `end` (based on CIGAR) and `strand` (based on FLAG) columns that denote respectively the end locations of reads within the reference sequence and the strand of the reference sequence reads were mapped to. The `pos` and `end` are 1-based and both inclusive, which corresponds to GenBank notation. |
100103
| 5. | `analyseGenes` | Using `templates/analyse_genes.py` Python script that utilises PyEnsembl module, obtains information of genes the input reads were mapped within. It uses `params.genomeGtfFile` that indicates the location of the file with annotations for the reference sequences. Saves the results to a gzipped TSV file `gene_analysis.tsv.gz`. The output file contains `qname` column (a read sequence id) next to `gene_names` and `gene_ids` columns that contain respectively gene names and their ids obtained from Ensembl database. If there is more than one gene in the locus where a given read was mapped, names/ids are separated by a semicolon followed by space (`'; '`). The resulting data may be used to check whether the gene name provided in a read sequence id (_qname_) may be found among names obtained from Ensembl database based on a read location. |
101104
| 6. | `fetchMatrix` | Using `templates/fetch_matrix.r` R script that utilises TCGAbiolinks R Bioconductor module, obtains expression matrices for samples, the name of which are given in the `params.samplesTxtFile`. Saves the results to a gzipped TSV file `gene_matrix.tsv.gz`. The first column of the output file is an index column that contains gene ids (selected during the previous stage), and the remaining columns contain expression data for the samples in the order their ids are provided in the input `params.samplesTxtFile`. |
102105

conda/workflow-py.txt

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# This file may be used to create an environment using:
2+
# $ conda create --name <env> --file <this file>
3+
# platform: linux-64
4+
@EXPLICIT
5+
https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2
6+
https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2023.5.7-hbcca054_0.conda
7+
https://conda.anaconda.org/conda-forge/linux-64/ld_impl_linux-64-2.40-h41732ed_0.conda
8+
https://conda.anaconda.org/conda-forge/linux-64/libgfortran5-13.1.0-h15d22d2_0.conda
9+
https://conda.anaconda.org/conda-forge/linux-64/libstdcxx-ng-13.1.0-hfd8a6a1_0.conda
10+
https://conda.anaconda.org/conda-forge/linux-64/python_abi-3.10-3_cp310.conda
11+
https://conda.anaconda.org/conda-forge/noarch/tzdata-2023c-h71feb2d_0.conda
12+
https://conda.anaconda.org/conda-forge/linux-64/libgfortran-ng-13.1.0-h69a702a_0.conda
13+
https://conda.anaconda.org/conda-forge/linux-64/libgomp-13.1.0-he5830b7_0.conda
14+
https://conda.anaconda.org/conda-forge/linux-64/_openmp_mutex-4.5-2_gnu.tar.bz2
15+
https://conda.anaconda.org/conda-forge/linux-64/libgcc-ng-13.1.0-he5830b7_0.conda
16+
https://conda.anaconda.org/conda-forge/linux-64/bzip2-1.0.8-h7f98852_4.tar.bz2
17+
https://conda.anaconda.org/conda-forge/linux-64/c-ares-1.19.1-hd590300_0.conda
18+
https://conda.anaconda.org/conda-forge/linux-64/gzip-1.12-h166bdaf_0.tar.bz2
19+
https://conda.anaconda.org/conda-forge/linux-64/icu-72.1-hcb278e6_0.conda
20+
https://conda.anaconda.org/conda-forge/linux-64/keyutils-1.6.1-h166bdaf_0.tar.bz2
21+
https://conda.anaconda.org/conda-forge/linux-64/libdeflate-1.18-h0b41bf4_0.conda
22+
https://conda.anaconda.org/conda-forge/linux-64/libev-4.33-h516909a_1.tar.bz2
23+
https://conda.anaconda.org/conda-forge/linux-64/libffi-3.4.2-h7f98852_5.tar.bz2
24+
https://conda.anaconda.org/conda-forge/linux-64/libiconv-1.17-h166bdaf_0.tar.bz2
25+
https://conda.anaconda.org/conda-forge/linux-64/libnsl-2.0.0-h7f98852_0.tar.bz2
26+
https://conda.anaconda.org/conda-forge/linux-64/libopenblas-0.3.23-pthreads_h80387f5_0.conda
27+
https://conda.anaconda.org/conda-forge/linux-64/libuuid-2.38.1-h0b41bf4_0.conda
28+
https://conda.anaconda.org/conda-forge/linux-64/libzlib-1.2.13-hd590300_5.conda
29+
https://conda.anaconda.org/conda-forge/linux-64/ncurses-6.4-hcb278e6_0.conda
30+
https://conda.anaconda.org/conda-forge/linux-64/openssl-3.1.1-hd590300_1.conda
31+
https://conda.anaconda.org/conda-forge/linux-64/xz-5.2.6-h166bdaf_0.tar.bz2
32+
https://conda.anaconda.org/conda-forge/linux-64/libblas-3.9.0-17_linux64_openblas.conda
33+
https://conda.anaconda.org/conda-forge/linux-64/libedit-3.1.20191231-he28a2e2_2.tar.bz2
34+
https://conda.anaconda.org/conda-forge/linux-64/libnghttp2-1.52.0-h61bc06f_0.conda
35+
https://conda.anaconda.org/conda-forge/linux-64/libsqlite-3.42.0-h2797004_0.conda
36+
https://conda.anaconda.org/conda-forge/linux-64/libssh2-1.11.0-h0841786_0.conda
37+
https://conda.anaconda.org/conda-forge/linux-64/libxml2-2.11.4-h0d562d8_0.conda
38+
https://conda.anaconda.org/conda-forge/linux-64/perl-5.32.1-2_h7f98852_perl5.tar.bz2
39+
https://conda.anaconda.org/conda-forge/linux-64/readline-8.2-h8228510_1.conda
40+
https://conda.anaconda.org/conda-forge/linux-64/tk-8.6.12-h27826a3_0.tar.bz2
41+
https://conda.anaconda.org/conda-forge/linux-64/zlib-1.2.13-hd590300_5.conda
42+
https://conda.anaconda.org/conda-forge/linux-64/zstd-1.5.2-h3eb15da_6.conda
43+
https://conda.anaconda.org/conda-forge/linux-64/krb5-1.20.1-h81ceb04_0.conda
44+
https://conda.anaconda.org/conda-forge/linux-64/libcblas-3.9.0-17_linux64_openblas.conda
45+
https://conda.anaconda.org/conda-forge/linux-64/libhwloc-2.9.1-nocuda_h7313eea_6.conda
46+
https://conda.anaconda.org/conda-forge/linux-64/liblapack-3.9.0-17_linux64_openblas.conda
47+
https://conda.anaconda.org/conda-forge/linux-64/python-3.10.11-he550d4f_0_cpython.conda
48+
https://conda.anaconda.org/conda-forge/linux-64/libcurl-8.1.2-h409715c_0.conda
49+
https://conda.anaconda.org/conda-forge/linux-64/numpy-1.24.3-py310ha4c1d20_0.conda
50+
https://conda.anaconda.org/conda-forge/noarch/python-tzdata-2023.3-pyhd8ed1ab_0.conda
51+
https://conda.anaconda.org/conda-forge/noarch/pytz-2023.3-pyhd8ed1ab_0.conda
52+
https://conda.anaconda.org/conda-forge/noarch/setuptools-67.7.2-pyhd8ed1ab_0.conda
53+
https://conda.anaconda.org/conda-forge/noarch/six-1.16.0-pyh6c4a22f_0.tar.bz2
54+
https://conda.anaconda.org/conda-forge/linux-64/tbb-2021.9.0-hf52228f_0.conda
55+
https://conda.anaconda.org/conda-forge/noarch/wheel-0.40.0-pyhd8ed1ab_0.conda
56+
https://conda.anaconda.org/bioconda/linux-64/bowtie2-2.5.1-py310ha0a81b8_2.tar.bz2
57+
https://conda.anaconda.org/bioconda/linux-64/htslib-1.17-h81da01d_2.tar.bz2
58+
https://conda.anaconda.org/conda-forge/noarch/pip-23.1.2-pyhd8ed1ab_0.conda
59+
https://conda.anaconda.org/bioconda/linux-64/pysam-0.21.0-py310h41dec4a_1.tar.bz2
60+
https://conda.anaconda.org/conda-forge/noarch/python-dateutil-2.8.2-pyhd8ed1ab_0.tar.bz2
61+
https://conda.anaconda.org/conda-forge/linux-64/pandas-2.0.2-py310h7cbd5c2_0.conda
62+
https://conda.anaconda.org/bioconda/linux-64/samtools-1.17-hd87286a_1.tar.bz2

0 commit comments

Comments
 (0)