Skip to content

JetBrains-Research/chipseq-smk-pipeline

Repository files navigation

JetBrains Research

chipseq-smk-pipeline

Snakemake based pipeline for ChIP-seq and ATAC-seq datasets processing from raw data QC and alignment to visualization and peak calling.

Scheme

During peak calling steps chipseq-smk-pipeline automatically matches signal with control file by names proximity.

Input

Input FASTQ files

Pipeline aligned FASTQ or gzipped FASTQ reads, defined in config.yaml.
Reads folder is a relative path in pipeline working directory and defined by fastq_dir property.
FASTQ reads extension is defined by fastq_ext property, e.g. could be fq, fq.gz, fastq, fastq.gz.

Input BAM files

Use start_with_bams=True config option to start with existing bam files.
Pipeline starts with BAM files in work_dir/bams folder.

Files

Path Description
config.yaml Default pipeline options
trimmed Trimmed FASTQ file, if trim_reads option is True.
bams BAMs with aligned reads, MAPQ >= 30
bw BAM coverage visualization using DeepTools
macs2 MACS2 peaks
sicer SICER peaks
span SPAN peaks
qc QC Reports
multiqc MultiQC reports for different steps
logs Shell commands logs

Requirements

The pipeline requires conda.

  • If conda is not installed, follow the instructions at Conda website.
  • Navigate to repository directory.

Create a Conda environment for snakemake:

$ conda env create --file environment.yaml --name snakemake

Activate the newly created environment:

$ source activate snakemake

On Ubuntu please ensure that gawk is installed:

$ sudo apt-get install gawk

Launch

Run the pipeline to start with fastq reads:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config fastq_dir=<fastq_dir> genome=<genome> --rerun-incomplete

The Default pipeline doesn't perform coverage visualization and launch peak callers. Please add bw=True, macs2=True, sicer=True, span=True to create coverage bw files and call peaks.

To launch MACS2 in --broad mode, use the following config:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config fastq_dir=<fastq_dir> genome=<genome> \
    macs2=True macs2_mode=broad macs2_params="--broad --broad-cutoff 0.1" macs2_suffix=broad0.1 \
    --rerun-incomplete

See config.yaml for a complete list of parameters. Use--config to override default options from config.yaml file.

Rules

Rules DAG produced with additional command line agruments --forceall --rulegraph | dot -Tpdf > rules.pdf

Rules

Computational cluster QSUB/LFS/QSUB

Configure profile for required cluster system with name cluster.

$ mkdir -p ~/.config/snakemake
$ cd ~/.config/snakemake
$ cookiecutter https://github.com/iromeo/generic.git

Example of ATAC-Seq processing on qsub

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all --use-conda --directory <work_dir> \
    --profile cluster --cluster-config cluster_config.yaml --jobs 150 \
    --config fastq_dir=<fastq_dir> genome=<genome> \
    bowtie2_params="-X 2000 --dovetail" \
    macs2=True macs2_params="-q 0.05 -f BAMPE --nomodel --nolambda -B --call-summits" \
    span=True span_fragment=0 span_bg_sensitivity=1.0 span_clip=0.4 --rerun-incomplete

P.S: Use --config to override default options from config.yaml file

Try with test data

Please download example fastq.gz files from CD14_chr15_fastq folder.
These files are filtered on human hg19 chr15 to reduce size and make computations faster.

Launch chipseq-smk-pipeline:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all --use-conda --cores all --directory <work_dir> \
    --config fastq_ext=fastq.gz fastq_dir=<work_dir> genome=hg19 macs2=True sicer=True span=True \
    --rerun-incomplete

Useful links