Skip to content

DucoG/PEsWGS-alignment-snakemake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paired-end Shallow Whole Genome Sequencing Snakemake Pipeline

This is a snakemake pipeline for the analysis of paired-end shallow whole genome sequencing data. The pipeline is designed to work with raw fastq files generated from Illumina HiSeq or NovaSeq. It includes the following steps:

  • Align reads to a reference genome using BWA
  • Index the resulting BAM file
  • Mark duplicates using Picard
  • Index the marked BAM file
  • Generate fastqc report for raw, aligned, and marked reads
  • Generate multiqc report for all three reports

Setup

To use this pipeline, first clone the repository and navigate to the project directory:

git clone https://github.com/DucoG/PEsWGS-alignment-snakemake.git
cd PEsWGS-alignment-snakemake

Config File

The config_snake.yaml contains the location of the reference genome required. Please update this file with your own values before running the pipeline.

Usage

Once the environment and config file have been set up, the pipeline can be run using snakemake. Before running the entire pipeline for all your samples, it is advisable to first check if all settings are correct using a dry run using the -n option for a dry-run and -p for printing the ran commands:

snakemake -np --use-conda

To run the entire pipeline, use the following command:

snakemake --use-conda -j <NUMBER_OF_JOBS>

This will generate marked BAM files and QC reports for the data in data/raw. Your directory will look as follows:

.
├── data                        <- contains all input and output files for the pipeline
│   ├── aligned
│   ├── marked
│   ├── raw
├── logs                        <- contains log files for each step of the pipeline
│   ├── index_bam
│   └── mark_duplicates
└── qc_outputs                  <- contains output files from quality control steps
    ├── marked
    │   ├── fastqc_output
    │   └── multiqc_output
    │       └── multiqc_data
    └── raw
        ├── fastqc_output
        └── multiqc_output
            └── multiqc_data

To run a specific rule, for example, to run only the fastq2bam rule, use the following command:

snakemake data/aligned/<SAMPLE_ID>_hg38.bam --use-conda

This will generate a BAM file for the specified sample in data/aligned.

Note

This pipeline does not trim the sequences automatically. To determine if trimming is necessary, start by generating the raw multiqc report. Based on that, trimming can be done on the raw data files before aligning.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages