diff --git a/LICENSE b/LICENSE old mode 100644 new mode 100755 diff --git a/README.md b/README.md old mode 100644 new mode 100755 index 2c8b00d..a01b81d --- a/README.md +++ b/README.md @@ -1,57 +1,90 @@ -# Nextflow pipeline for ABRA (Assembly Based ReAligner) +# abra-nf -Apply [ABRA](https://github.com/mozack/abra) to realign next generation sequencing data using localized assembly in a set of BAM files. After ABRA, the mate information is fixed using [`samtools fixmate`](http://www.htslib.org/doc/samtools.html) and BAM files are sorted and indexed using [sambamba](http://lomereiter.github.io/sambamba/). +## Nextflow pipeline for ABRA2 (Assembly Based ReAligner) + +![Workflow representation](abra-nf.png) + +## Description + +Apply [ABRA2](https://github.com/mozack/abra2) to realign next generation sequencing data using localized assembly in a set of BAM files. This scripts takes a set of [BAM files](https://samtools.github.io/hts-specs/) (called `*.bam`) grouped folders as an input. There are two modes: - When using matched tumor/normal pairs, the two samples of each pair are realigned together (see https://github.com/mozack/abra#somatic--mode). In this case the user has to provide as an input the folders containing tumor (`--tumor_bam_folder`) and normal BAM files (`--normal_bam_folder`) (it can be the same unique folder). The tumor bam file format must be (`sample` `suffix_tumor` `.bam`) with `suffix_tumor` as `_T` by default and customizable in input (`--suffix_tumor`). (e.g. `sample1_T.bam`). The normal bam file format must be (`sample` `suffix_normal` `.bam`) with `suffix_normal` as `_N` by default and customizable in input (`--suffix_normal`). (e.g. `sample1_N.bam`). -- When using only normal (or only tumor) samples, each bam is treated independently. In this case the user has to provide a single folder containing all BAM files (`bam_folder`). +- When using only normal (or only tumor) samples, each bam is treated independently. In this case the user has to provide a single folder containing all BAM files (`--bam_folder`). -In all cases BAI indexes have to be present in the same location than their BAM mates and called *.bam.bai`. +In all cases BAI indexes have to be present in the same location than their BAM mates and called `*.bam.bai`. -For [ABRA2](https://github.com/mozack/abra2) compatibility, use the option `--abra2` +Note that ABRA v1 is no longer supported (see the last version supporting it here: https://github.com/IARCbioinfo/abra-nf/releases/tag/v1.0) -## How to install +## Dependencies -1. Install [java](https://java.com/download/) JRE if you don't already have it. +1. This pipeline is based on [nextflow](https://www.nextflow.io). As we have several nextflow pipelines, we have centralized the common information in the [IARC-nf](https://github.com/IARCbioinfo/IARC-nf) repository. Please read it carefully as it contains essential information for the installation, basic usage and configuration of nextflow and our pipelines. -2. Install [nextflow](http://www.nextflow.io/). +2. External software: +- [java](https://www.java.com/) +- [ABRA2](https://github.com/mozack/abra2) jar file - ```bash - curl -fsSL get.nextflow.io | bash - ``` - And move it to a location in your `$PATH` (`/usr/local/bin` for example here): - ```bash - sudo mv nextflow /usr/local/bin - ``` - -3. Install and put in your PATH: [java](https://www.java.com/), [bedtools](http://bedtools.readthedocs.io/en/latest/), [bwa](http://bio-bwa.sourceforge.net), [sambamba](http://lomereiter.github.io/sambamba/), [samtools](http://www.htslib.org/) and download ABRA jar. Alternatively (recommended), you can simply use the docker image provided (see below). +You can avoid installing all the external software by only installing Docker. See the [IARC-nf](https://github.com/IARCbioinfo/IARC-nf) repository for more information. -## How to run +## Input -Simply use example: -```bash -nextflow run iarcbioinfo/abra-nf --bam_folder BAM/ --bed target.bed --ref ref.fasta --read_length 100 --abra_path /path/to/abra.jar -``` + * #### In tumor-normal mode -By default, BAM files produced are output in the same folder as the input folder with the `abra_sorted_fixmate.bam` suffix. One can also specify the output folder by adding the optional argument `--out_folder BAM_ABRA` to the above command line for example. +| Name | Description | +|-----------|---------------| +| `--tumor_bam_folder` | Folder containing tumor BAM files | +| `--normal_bam_folder` | Folder containing matched normal BAM files | +| `--suffix_tumor` | Suffix identifying tumor bam (default: `_T`) | +| `--suffix_normal` | Suffix identifying normal bam (default: `_N`) | -You can print the help by providing `--help` in the execution command line: -```bash -nextflow run iarcbioinfo/abra-nf --help -``` + * #### Otherwise -Instead of installing all tools in step 3 above, we recommend to use the docker image we provide containing them by simply adding `-with-docker`: -```bash -nextflow run iarcbioinfo/abra-nf -with-docker ... -``` +| Name | Description | +|-----------|---------------| +| `--bam_folder` | Folder containing BAM files | + +## Parameters -Installing [docker](https://www.docker.com) is very system specific (but quite easy in most cases), follow [docker documentation](https://docs.docker.com/installation/). Also follow the optional configuration step called `Create a Docker group` in their documentation. + * #### Mandatory -## Detailed instructions +| Name | Example value | Description | +|-----------|---------------|-----------------| +| `--ref` | `/path/to/ref.fasta` | Reference fasta file indexed | +| `--abra_path` | `/path/to/abra2.jar` | abra.jar explicit path | -The exact same pipeline can be run on your computer or on a HPC cluster, by adding a [nextflow configuration file](http://www.nextflow.io/docs/latest/config.html) to choose an appropriate [executor](http://www.nextflow.io/docs/latest/executor.html). For example to work on a cluster using [SGE scheduler](https://en.wikipedia.org/wiki/Oracle_Grid_Engine), simply add a file named `nextflow.config` in the current directory (or `~/.nextflow/config` to make global changes) containing: -```java -process.executor = 'sge' + * #### Optional + +| Name | Default value | Description | +|-----------|---------------|-----------------| +| `--bed` | `/path/to/intervals.bed` | Bed file containing intervals | +| `--mem` | 16 | Maximum RAM used | +| `--threads` | 4 | Number of threads used | +| `--output_folder` | `abra_BAM/` | Bed file containing intervals | + + * #### Flags + +Flags are special parameters without value. + +| Name | Description | +|-----------|-----------------| +| `--help` | Display help | +| `--single` | Switch to single-end sequencing mode | + +## Usage + +Simple use case example: +```bash +nextflow run iarcbioinfo/abra-nf --bam_folder BAM/ --bed target.bed --ref ref.fasta --abra_path /path/to/abra.jar ``` -Other popular schedulers such as LSF, SLURM, PBS, TORQUE etc. are also compatible. See the nextflow documentation [here](http://www.nextflow.io/docs/latest/executor.html) for more details. Also have a look at the [other parameters for the executors](http://www.nextflow.io/docs/latest/config.html#scope-executor), in particular `queueSize` that defines the number of tasks the executor will handle in a parallel manner. +## Output + | Type | Description | + |-----------|---------------| + | ABRA BAM | Realigned BAM files with their indexes | + +## Contributions + + | Name | Email | Description | + |-----------|---------------|-----------------| + | Matthieu Foll* | follm@iarc.fr | Developer to contact for support | + | Nicolas Alcala | alcalan@fellows.iarc.fr | Developer | diff --git a/abra-nf.png b/abra-nf.png old mode 100644 new mode 100755 index 80f3d69..9c8f578 Binary files a/abra-nf.png and b/abra-nf.png differ diff --git a/abra-nf.svg b/abra-nf.svg old mode 100644 new mode 100755 diff --git a/abra.nf b/abra.nf old mode 100644 new mode 100755 index 9bb3e55..435fd4d --- a/abra.nf +++ b/abra.nf @@ -1,68 +1,90 @@ -#!/usr/bin/env nextflow +#! /usr/bin/env nextflow -// requires (in path): -// java -// bedtools -// bwa -// sambamba -// samtools -// abra jar +//vim: syntax=groovy -*- mode: groovy;-*- + +// Copyright (C) 2017 IARC/WHO + +// This program is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. + +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. + +// You should have received a copy of the GNU General Public License +// along with this program. If not, see . params.help = null params.tumor_bam_folder = null params.normal_bam_folder = null params.bam_folder = null params.bed = null +params.single = null params.ref = null params.abra_path = null -params.read_length = null -params.abra2 = "false" + +log.info "" +log.info "--------------------------------------------------------" +log.info " abra2-nf v2.0: Nextflow pipeline for ABRA2 " +log.info "--------------------------------------------------------" +log.info "Copyright (C) IARC/WHO" +log.info "This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE" +log.info "This is free software, and you are welcome to redistribute it" +log.info "under certain conditions; see LICENSE for details." +log.info "--------------------------------------------------------" +log.info "" if (params.help) { log.info '' log.info '--------------------------------------------------' - log.info ' NEXTFLOW for abra ' + log.info ' USAGE ' log.info '--------------------------------------------------' log.info '' log.info 'Usage: ' - log.info 'nextflow run abra_TN_pairs.nf --tumor_bam_folder tumor_BAM/ --normal_bam_folder normal_BAM/ --bed mybedfile.bed --ref ref.fasta' + log.info 'nextflow run iarcbioinf/abra-nf --tumor_bam_folder tumor_BAM/ --normal_bam_folder normal_BAM/ --ref ref.fasta' log.info '' log.info 'Mandatory arguments:' - log.info ' When using Tumor/Normal pairs:' + log.info ' When using Tumor/Normal pairs:' log.info ' --tumor_bam_folder FOLDER Folder containing tumor BAM files.' log.info ' --normal_bam_folder FOLDER Folder containing matched normal BAM files.' - log.info ' In other cases:' + log.info ' In other cases:' log.info ' --bam_folder FOLDER Folder containing BAM files.' - log.info ' In all cases:' - log.info ' --bed FILE Bed file containing intervals.' - log.info ' --ref FILE (with index) Reference fasta file indexed by bwa.' + log.info ' In all cases:' + log.info ' --ref FILE (with index) Reference fasta file indexed.' log.info ' --abra_path FILE abra.jar explicit path.' - log.info ' --read_length INT Read length (e.g.: 100).' log.info 'Optional arguments:' - log.info ' When using Tumor/Normal pairs:' + log.info ' When using Tumor/Normal pairs:' log.info ' --suffix_tumor STRING Suffix identifying tumor bam (default: "_T").' log.info ' --suffix_normal STRING Suffix identifying normal bam (default: "_N").' - log.info ' In all cases:' + log.info ' In all cases:' + log.info ' --single Flag for single-end sequencing.' + log.info ' --bed FILE Bed file containing intervals.' log.info ' --mem INTEGER RAM used (in GB, default: 16)' log.info ' --threads INTEGER Number of threads (default: 4)' - log.info ' --out_folder FOLDER Output folder (default: abra_BAM).' + log.info ' --output_folder FOLDER Output folder (default: abra_BAM).' log.info '' exit 1 } assert (params.ref != true) && (params.ref != null) : "please specify --ref option (--ref reference.fasta(.gz))" -if(params.bam_folder) { +if (params.bam_folder) { assert (params.bam_folder != true) && (params.bam_folder != null) : "please specify --bam_folder option (--bam_folder bamfolder)" } else { assert (params.normal_bam_folder != true) && (params.normal_bam_folder != null) : "please specify --normal_bam_folder option (--normal_bam_folder bamfolder)" assert (params.tumor_bam_folder != true) && (params.tumor_bam_folder != null) : "please specify --tumor_bam_folder option (--tumor_bam_folder bamfolder)" } -assert (params.bed != true) && (params.bed != null) : "please specify --bed option (--bed regions.bed)" -assert (params.abra_path != true) && (params.abra_path != null) : "please specify --abra_path option (--abra_path /path/to/abra.jar)" -assert (params.read_length != true) && (params.read_length != null) : "please specify --read_length option (--read_length 100)" +if (params.bed!=null) { + assert (params.bed != true) : "please specify file when using --bed option (--bed regions.bed)" + try { assert file(params.bed).exists() : "\n WARNING : input bed file not located in execution directory" } catch (AssertionError e) { println e.getMessage() } +} +bed = params.bed ? file(params.bed) : file('nothing') +assert (params.abra_path != true) && (params.abra_path != null) : "please specify --abra_path option (--abra_path /path/to/abra.jar)" fasta_ref = file(params.ref) fasta_ref_fai = file( params.ref+'.fai' ) @@ -73,53 +95,17 @@ fasta_ref_ann = file( params.ref+'.ann' ) fasta_ref_amb = file( params.ref+'.amb' ) fasta_ref_pac = file( params.ref+'.pac' ) -bed = file(params.bed) - params.suffix_tumor = "_T" params.suffix_normal = "_N" params.mem = 16 params.threads = 4 -params.out_folder = "abra_BAM" - -try { assert file(params.bed).exists() : "\n WARNING : input bed file not located in execution directory" } catch (AssertionError e) { println e.getMessage() } +params.output_folder = "abra_BAM" try { assert fasta_ref.exists() : "\n WARNING : fasta reference not located in execution directory. Make sure reference index is in the same folder as fasta reference" } catch (AssertionError e) { println e.getMessage() } if (fasta_ref.exists()) {assert fasta_ref_fai.exists() : "input fasta reference does not seem to have a .fai index (use samtools faidx)"} -if (fasta_ref.exists()) {assert fasta_ref_sa.exists() : "input fasta reference does not seem to have a .sa index (use bwa index)"} -if (fasta_ref.exists()) {assert fasta_ref_bwt.exists() : "input fasta reference does not seem to have a .bwt index (use bwa index)"} -if (fasta_ref.exists()) {assert fasta_ref_ann.exists() : "input fasta reference does not seem to have a .ann index (use bwa index)"} -if (fasta_ref.exists()) {assert fasta_ref_amb.exists() : "input fasta reference does not seem to have a .amb index (use bwa index)"} -if (fasta_ref.exists()) {assert fasta_ref_pac.exists() : "input fasta reference does not seem to have a .pac index (use bwa index)"} if (fasta_ref.exists() && params.ref.tokenize('.')[-1] == 'gz') {assert fasta_ref_gzi.exists() : "input gz fasta reference does not seem to have a .gzi index (use samtools faidx)"} -assert (params.read_length > 0) : "read length must be higher than 0 (--read_length)" - -process bed_kmer_size { - - cpus params.threads - - input: - file bed - file fasta_ref - file fasta_ref_fai - file fasta_ref_gzi - file fasta_ref_sa - file fasta_ref_bwt - file fasta_ref_ann - file fasta_ref_amb - file fasta_ref_pac - - output: - file "kmer_size_abra.bed" into bed_kmer - - shell: - ''' - grep -v '^track' !{bed} | sort -k1,1 -k2,2n | bedtools merge -i stdin | awk '{print $1"\t"$2"\t"$3}' > tmp_merged_sorted.bed - java -Xmx4G -cp !{params.abra_path} abra.KmerSizeEvaluator !{params.read_length} !{fasta_ref} kmer_size_abra.bed !{params.threads} tmp_merged_sorted.bed - ''' -} - if(params.bam_folder) { try { assert file(params.bam_folder).exists() : "\n WARNING : input BAM folder not located in execution directory" } catch (AssertionError e) { println e.getMessage() } @@ -139,42 +125,38 @@ if(params.bam_folder) { bam_bai = bams .phase(bais) .map { bam, bai -> [ bam[1], bai[1] ] } - + process abra { cpus params.threads - memory params.mem+'GB' + memory params.mem+'GB' tag { bam_tag } - - publishDir params.out_folder, mode: 'move', pattern: '*_SV.txt' + + publishDir params.output_folder, mode: 'move' input: file bam_bai - file bed_kmer from bed_kmer.first() + file bed file fasta_ref file fasta_ref_fai file fasta_ref_gzi - file fasta_ref_sa + file fasta_ref_sa file fasta_ref_bwt file fasta_ref_ann file fasta_ref_amb file fasta_ref_pac output: - file("${bam_tag}_abra.bam") into bam_abra - file("${bam_tag}_SV.txt") optional true into SV_output + file("${bam_tag}_abra.ba*") into bam_out shell: bam_tag = bam_bai[0].baseName - if(params.abra2=="false") abraoptions="--working abra_tmp --sv tmp_SV.txt" - else abraoptions="--tmpdir ." + abra_single = params.single ? '--single --mapq 20' : '' + abra_bed = params.bed ? "--targets $bed" : '' ''' - java -Xmx!{params.mem}g -jar !{params.abra_path} --in !{bam_tag}.bam --out "!{bam_tag}_abra.bam" --ref !{fasta_ref} --target-kmers !{bed_kmer} --threads !{params.threads} !{abraoptions} > !{bam_tag}_abra.log 2>&1 - if [ -f tmp_SV.txt ]; then - mv tmp_SV.txt !{bam_tag}_SV.txt - fi - ''' + java -Xmx!{params.mem}g -jar !{params.abra_path} --in !{bam_tag}.bam --out "!{bam_tag}_abra.bam" --ref !{fasta_ref} --tmpdir . --threads !{params.threads} --index !{abra_single} !{abra_bed} > !{bam_tag}_abra.log 2>&1 + ''' } @@ -185,7 +167,7 @@ if(params.bam_folder) { try { assert file(params.normal_bam_folder).exists() : "\n WARNING : input normal BAM folder not located in execution directory" } catch (AssertionError e) { println e.getMessage() } assert file(params.normal_bam_folder).listFiles().findAll { it.name ==~ /.*bam/ }.size() > 0 : "normal BAM folder contains no BAM" - // FOR TUMOR + // FOR TUMOR // recovering of bam files tumor_bams = Channel.fromPath( params.tumor_bam_folder+'/*'+params.suffix_tumor+'.bam' ) .ifEmpty { error "Cannot find any bam file in: ${params.tumor_bam_folder}" } @@ -201,7 +183,7 @@ if(params.bam_folder) { .phase(tumor_bais) .map { tumor_bam, tumor_bai -> [ tumor_bam[0], tumor_bam[1], tumor_bai[1] ] } - // FOR NORMAL + // FOR NORMAL // recovering of bam files normal_bams = Channel.fromPath( params.normal_bam_folder+'/*'+params.suffix_normal+'.bam' ) .ifEmpty { error "Cannot find any bam file in: ${params.normal_bam_folder}" } @@ -220,74 +202,41 @@ if(params.bam_folder) { // building 4-uplets corresponding to {tumor_bam, tumor_bai, normal_bam, normal_bai} tn_bambai = tumor_bam_bai .phase(normal_bam_bai) - .map {tumor_bb, normal_bb -> [ tumor_bb[1], tumor_bb[2], normal_bb[1], normal_bb[2] ] } + .map {tumor_bb, normal_bb -> [ tumor_bb[1], tumor_bb[2], normal_bb[1], normal_bb[2] ] } // here each element X of tn_bambai channel is a 4-uplet. X[0] is the tumor bam, X[1] the tumor bai, X[2] the normal bam and X[3] the normal bai. process abra_TN { cpus params.threads - memory params.mem+'GB' + memory params.mem+'GB' tag { tumor_normal_tag } - publishDir params.out_folder, mode: 'move', pattern: '*_SV.txt' + publishDir params.output_folder, mode: 'move' input: file tn from tn_bambai - file bed_kmer from bed_kmer.first() + file bed file fasta_ref file fasta_ref_fai file fasta_ref_gzi - file fasta_ref_sa + file fasta_ref_sa file fasta_ref_bwt file fasta_ref_ann file fasta_ref_amb file fasta_ref_pac - output: - // file("${tumor_normal_tag}${params.suffix_normal}_abra.bam") into normal_output - // file("${tumor_normal_tag}${params.suffix_tumor}_abra.bam") into tumor_output - file '*_abra.bam' into bam_abra mode flatten - file("${tumor_normal_tag}_SV.txt") optional true into SV_output + file("${tumor_normal_tag}${params.suffix_normal}_abra.ba*") into normal_output + file("${tumor_normal_tag}${params.suffix_tumor}_abra.ba*") into tumor_output shell: tumor_normal_tag = tn[0].baseName.replace(params.suffix_tumor,"") - if(params.abra2=="false") abraoptions="--working abra_tmp --sv tmp_SV.txt" - else abraoptions="--tmpdir ." - - ''' - java -Xmx!{params.mem}g -jar !{params.abra_path} --in !{tumor_normal_tag}!{params.suffix_normal}.bam,!{tumor_normal_tag}!{params.suffix_tumor}.bam --out "!{tumor_normal_tag}!{params.suffix_normal}_abra.bam","!{tumor_normal_tag}!{params.suffix_tumor}_abra.bam" --ref !{fasta_ref} --target-kmers !{bed_kmer} --threads !{params.threads} !{abraoptions}> !{tumor_normal_tag}_abra.log 2>&1 - if [ -f tmp_SV.txt ]; then - mv tmp_SV.txt !{tumor_normal_tag}_SV.txt - fi - ''' + abra_single = params.single ? '--single --mapq 20' : '' + abra_bed = params.bed ? "--targets $bed" : '' + ''' + java -Xmx!{params.mem}g -jar !{params.abra_path} --in !{tumor_normal_tag}!{params.suffix_normal}.bam,!{tumor_normal_tag}!{params.suffix_tumor}.bam --out "!{tumor_normal_tag}!{params.suffix_normal}_abra.bam","!{tumor_normal_tag}!{params.suffix_tumor}_abra.bam" --ref !{fasta_ref} --threads !{params.threads} --index !{abra_single} !{abra_bed} > !{tumor_normal_tag}_abra.log 2>&1 + ''' } } - -process fixmate_sort_index { - - cpus params.threads - memory params.mem+'GB' - - tag { bam_tag } - - publishDir params.out_folder, mode: 'move' - - input: - file bam_abra - - output: - file '*abra_sorted_fixmate.bam*' into final_bam - - shell: - bam_tag = bam_abra.baseName - half_mem = params.mem.intdiv(2) - half_threads = params.threads.intdiv(2) - 1 - ''' - set -o pipefail - sambamba sort -t !{half_threads} -m !{half_mem}G -n --tmpdir=sort_tmp -o /dev/stdout !{bam_abra} | samtools fixmate - - | sambamba sort -t !{half_threads} -m !{half_mem}G --tmpdir=sort_tmp -o "!{bam_tag}_sorted_fixmate.bam" /dev/stdin - ''' -} - diff --git a/nextflow.config b/nextflow.config old mode 100644 new mode 100755