This repository contains the Snakemake-based workflow for implementing deep mutational scanning experiments used in the Fraser and Coyote-Maestas labs.
Briefly, this conducts initial QC and mapping using BBTools, followed by the AnalyzeSaturationMutagenesis GATK module to call variants in each replicate. After variant calling, the list of observed variants in each read is filtered based on the list of designed variants, and the resulting counts are used to infer the fitness of each variant using Rosace and (optionally) Enrich2.
The pipeline is designed to be flexible and modular and should be amenable to use with a variety of experimental designs. Please note several current limitations, however.
- GATK-based Snakemake pipeline for deep mutational scanning experiments
git clone https://github.com/odcambc/dumpling
cd dumpling
conda env create --file dumpling_env.yaml
conda activate dumpling_env
Note that, on ARM-based Macs, the conda environment may fail to install due to required packages not being available for that platform. Assuming that Rosetta is installed, the environment can be installed using emulation with the following command:
CONDA_SUBDIR=osx-64 conda env create --file dumpling_env.yaml
You will also need to set the "samtools_local" variable in the config yaml to "true" to tell the pipeline to use this local version.
If the environment installed and activated properly,
edit the configuration files in the config
directory as needed. Then run the pipeline with:
snakemake -s workflow/Snakefile --software-deployment-method conda --cores 16
Download or fork this repository and edit the configuration files as needed.
This pipeline uses the Rosace scoring tool. Rosace uses CmdStanR and R to infer scores.
Dumpling uses renv to handle R dependencies.
This pipeline also includes a minimal faculty to install Rosace automatically, but issues are
possible. This can be invoked by calling the install_rosace
rule:
snakemake --cores 8 install_rosace
This tries to install renv, restore the renv environment, and install Rosace and CmdStanR. If this fails, please try installing Rosace manually.
We recommend installing Rosace manually before running the pipeline, or at least verifying that the install script works. More details about manually installing Rosace are available in the vignettes of the package and at the repository linked above.
The simplest way to handle dependencies is with Conda and the provided environment file.
conda env create --file dumpling_env.yaml
This will create a new environment named dumpling
with all the dependencies installed. Then simply activate the environment and you're ready to go.
conda activate dumpling_env
The following are the dependencies required to run the pipeline:
The details of an experiment need to be specified in a configuration file that defines parameters and an associated experiment file that details the experimental setup.
The configuration file is a YAML file: full details are included in the example file
config/test_config.yaml
and in the schema file schemas/config.schema.yaml
.
The experiment file is a CSV file that relates experimental conditions,
replicates, and time points to sequencing files: full details are included
in the config file and in the schema file schemas/experiments.schema.yaml
.
Additionally, a reference fasta file is required for mapping. This should be
placed in the references
directory, and the path to the file should be specified in the config file.
This pipeline also employs a processing step to standardize variant nomenclature
and remove any variants that are not designed or are likely errors. This
requires a CSV file containing the set of designed variants, including their
specific codon changes. This should be placed in the config/designed_variants
directory,
and the path to the file should be specified in the config file. An example file
is included in config/designed_variants/test_variants.csv
. This pipeline can generate
the variants CSV from the output set of oligos produced by the DIMPLE
library generation protocol: this can be enabled by including the path to the oligo CSV file in the config
file and setting regenerate_variants
to True
in the config.
The pipeline has the following directory structure:
├── workflow
│ ├── rules
│ ├── envs
│ ├── scripts
│ └── Snakefile
├── config
│ ├── test_config.yaml
│ ├── test_config.csv
│ ├── designed_variants
│ │ └── test_variants.csv
│ └── oligos
│ └── test_oligos.csv
├── logs
│ └── ...
├── references
│ └── test_ref.fasta
├── results
│ └── ...
├── schemas
│ ├── config.schema.yaml
│ └── experiments.schema.yaml
├── stats
│ └── ...
├── resources
│ ├── adapters.fa
│ ├── sequencing_artifacts.fa.gz
│ └── ...
We normally use one instance of the pipeline for each experiment. This allows for simpler tracking and reproducibility of individual experiments: for a new dataset, fork the repo, edit the configuration files, and run the pipeline. This way, a record of the exact configuration and environment can be saved. It is possible to run multiple experiments in the same folder, but this is more difficult to reproduce.
Once the dependencies have been installed (whether via conda or otherwise) the pipeline can be run with the following command:
snakemake -s workflow/Snakefile --software-deployment-method conda --cores 8
The maximum number of cores can be specified with the --cores
flag. The --software-deployment-method conda
flag
tells Snakemake to use conda to create the environment specified within each rule.
The pipeline generates a variety of output files. These are organized into the following directories:
benchmarks
: details of the runtime and process usage for each rulelogs
: log files from each ruleresults
: outputs from each rule (Note: many of these are intermediate files and are deleted by default).stats
: various processing statistics from each ruleref
: mapping target files generated by BBTools
These are ignored by git by default.
A variety of stats from tool outputs are provided in the stats
directory. These are
aggregated using MultiQC. The aggregated reports contain:
- FastQC reports for raw reads (read counts, base quality, adapter content, etc.)
- BBTools reports
- BBDuk reports for adapter trimming and contamination removal
- BBMerge reports for merging paired-end reads
- BBMap reports for mapping reads to the reference
- GATK AnalyzeSaturationMutagenesis reports for variant calling
- Reports for variant filtering
If a baseline condition is defined, a separate baseline report is also generated.
The files are saved as stats/{experiment_name}_multiqc_report.html
and
stats/{experiment_name}_baseline_multiqc_report.html
by default.
A starting analysis and plotting workflow is available in an associated repository: https://github.com/odcambc/dms_analysis_stub
We aim to regularly update this pipeline and continually expand its functionality. However, there are currently several known limitations.
- The pipeline is currently designed for short-read sequencing. It does not support long-read PacBio or Nanopore sequencing.
- The pipeline is currently designed for direct sequencing. It does not support barcoded sequencing.
- The pipeline is currently designed for single-site variants (including varying-length indels, however). It largely does not support combinatorial variants.
- The designed variant generation step is currently optimized for DIMPLE libraries. Other protocols may require the user to generate the designed variants CSV themself.
- Rosace is designed for growth-based experiments. It is not optimized for FACS-seq experiments.
- This pipeline may not work properly if the data is in a cloud server (i.e., a Box drive) or other non-standard file system.
This workflow is described in the following publication:
- Preprint: Rao et al., 2023
- Published: Rao et al., 2024
This is licensed under the MIT license. See the LICENSE file for details.
Contributions and feedback are welcome. Please submit an issue or pull request.
For any issues, please open an issue on the GitHub repository. For questions or feedback, email Chris.