Merge branch 'master' into update-nf-customize

nextflow-io · Oct 4, 2024 · 6432f6e · 6432f6e
2 parents 5969095 + 7fc27fc
commit 6432f6e
Show file tree

Hide file tree

Showing 44 changed files with 1,214 additions and 636 deletions.
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -0,0 +1,33 @@
+{
+    "name": "nfcore",
+    "image": "nfcore/gitpod:latest",
+    "remoteUser": "gitpod",
+
+    // Configure tool-specific properties.
+    "customizations": {
+        // Configure properties specific to VS Code.
+        "vscode": {
+            // Set *default* container specific settings.json values on container create.
+            "settings": {
+                "python.defaultInterpreterPath": "/opt/conda/bin/python",
+                "python.linting.enabled": true,
+                "python.linting.pylintEnabled": true,
+                "python.formatting.autopep8Path": "/opt/conda/bin/autopep8",
+                "python.formatting.yapfPath": "/opt/conda/bin/yapf",
+                "python.linting.flake8Path": "/opt/conda/bin/flake8",
+                "python.linting.pycodestylePath": "/opt/conda/bin/pycodestyle",
+                "python.linting.pydocstylePath": "/opt/conda/bin/pydocstyle",
+                "python.linting.pylintPath": "/opt/conda/bin/pylint"
+            },
+
+            // Add the IDs of extensions you want installed when the container is created.
+            "extensions": ["ms-python.python", "ms-python.vscode-pylance", "nf-core.nf-core-extensionpack"]
+        }
+    },
+    "portsAttributes": {
+        "3000": {
+            "label": "Application",
+            "onAutoForward": "openPreview"
+        }
+    }
+}
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,13 @@
+repos:
+    - repo: https://github.com/pre-commit/mirrors-prettier
+      rev: "v3.1.0"
+      hooks:
+          - id: prettier
+            additional_dependencies:
+                - [email protected]
+
+    - repo: https://github.com/editorconfig-checker/editorconfig-checker.python
+      rev: "2.7.3"
+      hooks:
+          - id: editorconfig-checker
+            alias: ec
diff --git a/docs/advanced/configuration.md b/docs/advanced/configuration.md
@@ -36,10 +36,10 @@ These configuration values would be inherited by every run on that system withou
 
 ## Overriding for a run - `$PWD/nextflow.config`
 
-Move into the chapter example directory:
+Create a chapter example directory:
 
 ```
-cd configuration
+mkdir configuration && cd configuration
 ```
 
 ### Overriding Process Directives
@@ -72,7 +72,7 @@ Glob pattern matching can also be used:
 
 ```groovy
 process {
-    withLabel: '.*:INDEX' {
+    withName: '.*:INDEX' {
         cpus = 2
     }
 }

diff --git a/docs/basic_training/cache_and_resume.md b/docs/basic_training/cache_and_resume.md
@@ -107,7 +107,7 @@ It’s good practice to organize each **experiment** in its own folder. The main
 The `nextflow log` command lists the executions run in the current folder:
 
 ```console
-$ nextflow log
+nextflow log
 ```
 
 ```console title="Output"
@@ -146,7 +146,7 @@ nextflow log tiny_fermat
 The `-f` (fields) option can be used to specify which metadata should be printed by the `log` command:
 
 ```console
-$ nextflow log tiny_fermat -f 'process,exit,hash,duration'
+nextflow log tiny_fermat -f 'process,exit,hash,duration'
 ```
 
 ```console title="Output"
@@ -167,7 +167,7 @@ nextflow log -l
 The `-F` option allows the specification of filtering criteria to print only a subset of tasks:
 
 ```console
-$ nextflow log tiny_fermat -F 'process =~ /fastqc/'
+nextflow log tiny_fermat -F 'process =~ /fastqc/'
 ```
 
 ```console title="Output"
@@ -283,58 +283,159 @@ process FOO {
     val x
 
     output:
-    tuple val(task.index), val(x)
+    stdout
 
     script:
     """
     sleep \$((RANDOM % 3))
+    echo -n "$x"
+    """
+}
+
+process BAR {
+    input:
+    val x
+
+    output:
+    stdout
+
+    script:
+    """
+    echo -n "$x" | tr '[:upper:]' '[:lower:]'
+    """
+}
+
+process FOOBAR {
+    input:
+    val foo
+    val bar
+
+    output:
+    stdout
+
+    script:
+    """
+    echo $foo - $bar
     """
 }
 
 workflow {
-    channel.of('A', 'B', 'C', 'D') | FOO | view
+    ch_letters = channel.of('A', 'B', 'C', 'D')
+    FOO(ch_letters)
+    BAR(ch_letters)
+    FOOBAR(FOO.out, BAR.out).view()
+
 }
 ```
 
-Just like you saw at the beginning of this tutorial with HELLO WORLD or WORLD HELLO, the output of the snippet above can be:
+Processes FOO and BAR receive the same inputs, and return something on standard output. Somebody wrote process FOOBAR to receive those processed outputs and generate combined output, from matched inputs. But even though we set the order of the input values, the FOO and BAR processes output whenever their tasks are complete, not respecting input order. So FOOBAR receiving the outputs of those channels, will not receive them in the same order, as you can see from the output:
 
 ```console title="Output"
-[0, A]
-[3, C]
-[4, D]
-[2, B]
-[1, A]
+...
+B - c
+
+D - a
+
+C - d
+
+A - b
 ```
 
-..and that order will likely be different every time the workflow is run.
+So D is matched with 'a' here, which was not the intention. That order will likely be different every time the workflow is run, meaning that the processing will not be deterministic, and caching will also not work, since the inputs to FOOBAR will vary constantly.
+
+!!! question "Exercise"
+
+    Re-run the above code a couple of times using `-resume`, and determine if the FOOBAR process reruns, or uses cached results.
+
+    ??? solution
+
+        You should see that while FOO and BAR reliably re-use their cache, FOOBAR will re-run at least a subset of its tasks due to differences in the combinations of inputs it recieves.
 
-Imagine that you now have two processes like this, whose output channels are acting as input channels to a third process. Both channels will be independently random, so the third process must not expect them to retain a paired sequence. If it does assume that the first element in the first process output channel is related to the first element in the second process output channel, there will be a mismatch.
+        The output will look like this:
 
-A common solution for this is to use what is commonly referred to as a _meta map_. A groovy object with sample information is passed out together with the file results within an output channel as a tuple. This can then be used to pair samples from separate channels together for downstream use. For example, instead of putting just `/some/path/myoutput.bam` into a channel, you could use `['SRR123', '/some/path/myoutput.bam']` to make sure the processes are not incurring into a mismatch. Check the example below:
+        ```console title="Output"
+         [58/f117ed] FOO (4)    [100%] 4 of 4, cached: 4 ✔
+         [84/e88fd9] BAR (4)    [100%] 4 of 4, cached: 4 ✔
+         [6f/d3f672] FOOBAR (1) [100%] 4 of 4, cached: 2 ✔
+         D - c
+
+         A - d
+
+         C - a
+
+         B - b
+        ```
+
+A common solution for this is to use what is commonly referred to as a _meta map_. A groovy object with sample information is passed out together with the file results within an output channel as a tuple. This can then be used to pair samples from separate channels together for downstream use.
+
+To illustrate, here is a change to the above workflow, with meta maps added:
 
 ```groovy linenums="1" title="snippet.nf"
-// For example purposes only.
-// These would normally be outputs from upstream processes.
-Channel
-    .of(
-        [[id: 'sample_1'], '/path/to/sample_1.bam'],
-        [[id: 'sample_2'], '/path/to/sample_2.bam']
-    )
-    .set { bam }
+process FOO {
+    input:
+    tuple val(meta), val(x)
 
-// NB: sample_2 is now the first element, instead of sample_1
-Channel
-    .of(
-        [[id: 'sample_2'], '/path/to/sample_2.bai'],
-        [[id: 'sample_1'], '/path/to/sample_1.bai']
+    output:
+    tuple val(meta), stdout
+
+    script:
+    """
+    sleep \$((RANDOM % 3))
+    echo -n "$x"
+    """
+}
+
+process BAR {
+    input:
+    tuple val(meta), val(x)
+
+    output:
+    tuple val(meta), stdout
+
+    script:
+    """
+    echo -n "$x" | tr '[:upper:]' '[:lower:]'
+    """
+}
+
+process FOOBAR {
+    input:
+    tuple val(meta), val(foo), val(bar)
+
+    output:
+    stdout
+
+    script:
+    """
+    echo $foo - $bar
+    """
+}
+
+workflow {
+    ch_letters = channel.of(
+        [[id: 'A'], 'A'],
+        [[id: 'B'], 'B'],
+        [[id: 'C'], 'C'],
+        [[id: 'D'], 'D']
     )
-    .set { bai }
+    FOO(ch_letters)
+    BAR(ch_letters)
+    FOOBAR(FOO.out.join(BAR.out)).view()
+
+}
+```
+
+Now, we define `ch_letters` with a meta map (e.g. `[id: 'A']`). Both FOO and BAR pass the `meta` through and attach it to their outputs. Then, in our call to FOOBAR we can use a `join` operation to ensure that only matched values are passed. Running this code provides us with matched processes, as we'd expect:
+
+```console title="Output"
+...
+D - d
+
+B - b
+
+A - a
 
-// Instead of feeding the downstream process with these two channels separately, you can
-// join them and provide a single channel where the sample meta map is implicitly matched:
-bam
-    .join(bai)
-    | PROCESS_C
+C - c
 ```
 
 If meta maps are not possible, an alternative is to use the [`fair`](https://nextflow.io/docs/edge/process.html#fair) process directive. When this directive is specified, Nextflow will guarantee that the order of outputs will match the order of inputs (not the order in which the tasks run, only the order of the output channel).

diff --git a/docs/basic_training/channels.md b/docs/basic_training/channels.md
@@ -42,7 +42,7 @@ ch.view() // (1)!
 
 !!! question "Exercise"
 
-    The script `snippet1.nf` contains the code from above. Execute it with Nextflow and view the output.
+    The script `snippet.nf` contains the code from above. Execute it with Nextflow and view the output.
 
     ??? solution
 

diff --git a/docs/basic_training/containers.md b/docs/basic_training/containers.md
@@ -85,26 +85,26 @@ To exit from the container, stop the BASH session with the `exit` command.
 
 ### Your first Dockerfile
 
-Docker images are created by using a so-called `Dockerfile`, a simple text file containing a list of commands to assemble and configure the image with the software packages required. For example, a Dockerfile to create a container with `cowsay` installed could be as simple as this:
+Docker images are created by using a so-called `Dockerfile`, a simple text file containing a list of commands to assemble and configure the image with the software packages required. For example, a Dockerfile to create a container with `curl` installed could be as simple as this:
 
 ```dockerfile linenums="1" title="Dockerfile"
 FROM debian:bullseye-slim
 
 LABEL image.author.name "Your Name Here"
 LABEL image.author.email "[email protected]"
 
-RUN apt-get update && apt-get install -y curl cowsay
+RUN apt-get update && apt-get install -y curl
 
 ENV PATH=$PATH:/usr/games/
 ```
 
 Once your Dockerfile is ready, you can build the image by using the `build` command. For example:
 
 ```bash
-docker build -t <my-image> .
+docker build -t my-image .
 ```
 
-Where `<my-image>` is the user-specified name for the container image you plan to build.
+Where `my-image` is the user-specified name for the container image you plan to build.
 
 !!! tip
 
@@ -193,6 +193,10 @@ docker build -t my-image .
 
 ### Run Salmon in the container
 
+!!! tip
+
+    If you didn't complete the steps above, use the 'rnaseq-nf' image used elsewhere in these materials by specifying `nextflow/rnaseq-nf` in place of `my-image` in the following examples.
+
 You can run the software installed in the container by using the `run` command. For example, you can check that Salmon is running correctly in the container generated above by using the following command:
 
 ```bash
@@ -609,8 +613,8 @@ Contrary to other registries that will pull the latest image when no tag (versio
 You can also install `galaxy-util-tools` and search for _mulled_ containers in your CLI. You'll find instructions below, using conda to install the tool.
 
 ```bash
-conda activate a-conda-env-you-already-have
-conda install galaxy-tool-util
+conda create -n galaxy-tool-util -y galaxy-tool-util # Create a new environment with 'galaxy-tool-util' installed
+conda activate galaxy-tool-util
 mulled-search --destination quay singularity --channel bioconda --search bowtie samtools | grep mulled
 ```
 
@@ -670,7 +674,7 @@ Nextflow automatically sets up an environment for the given package names listed
 
 !!! question "Exercise"
 
-    The tools `fastqc` and `salmon` are both available in BioContainers. Add the appropriate `container` directives to the `FASTQC` and `QUANTIFICATION` processes in `script5.nf` to use BioContainers instead of the container image you have been using in this training.
+    The tools `fastqc` and `salmon` are both available in Biocontainers (`biocontainers/fastqc:v0.11.5` and `quay.io/biocontainers/salmon:1.7.0--h84f40af_0`, respectively). Add the appropriate `container` directives to the `FASTQC` and `QUANTIFICATION` processes in `script5.nf` to use Seqera Containers instead of the container image you have been using in this training.
 
     !!! tip "Hint"