See readthedocs for the full documentation of the pipeline.
quicksand (quick analysis of sedimentary ancient DNA) is an open-source Nextflow pipeline designed for rapid and accurate taxonomic classification of mammalian mitochondrial DNA (mtDNA) in aDNA samples. quicksand combines fast alignment-free classification using KrakenUniq with downstream mapping (BWA), post-classification filtering, and ancient DNA authentication. quicksand is optimized for speed and portablity and requires either Singularity or Docker.
To run Nextflow, you need a POSIX-compatible system (e.g., Linux or macOS). quicksand was developed and tested on Linux (x86_64 architecture)
To run quicksand, please install
- Nextflow v22.10 or larger
- Singularity or Docker
Note: To run quicksand in singularity, your kernel needs to support user-namespaces (see here or here).
The input for quicksand is a directory with user-supplied files in BAM or FASTQ format. Adapter-trimming, overlap-merging and sequence demultiplexing need to be performed by the user prior to running quicksand. Provide the directory with the --split flag
Caution
Each input-file should correspond to a single sequence-library. The processing of merged libraries with quicksand can lead to sequence loss because of the PCR-deduplication step with bam-rmdup
As a test file, download a mammalian mtDNA capture library from Denisova Cave Layer 20 (published in Zavala et al. 2021)
wget -P split \
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR564/ERR5640810/A20896.bamThe required KrakenUniq database, the reference genomes for mapping and the bed-files for low-complexity filtering are available on the MPI EVA FTP Servers. Custom versions of the reference material can be created with the quicksand-build pipeline
For the quickstart of quicksand, create a small test-database containing only the Hominidae, Bovidae and Hyaenidea mtDNA reference genomes (~150 genomes, runtime: ~3-5 minutes, size ~5GB).
(to reduce database size and runtime, use only --include Hominidae)
nextflow run mpieva/quicksand-build -r v3.1 \
--include Hominidae,Bovidae,Hyaenidae \
--outdir refseq \
-profile singularityTo download the full reference database (~60GB), use this command:
latest=$(curl http://ftp.eva.mpg.de/quicksand/LATEST)
wget -r -np -nc -nH --cut-dirs=3 --reject="*index.html*" -q --show-progress -P refseq http://ftp.eva.mpg.de/quicksand/build/$latestThis can take several hours! For testing quicksand its recommended to build a small database (see above)
quicksand is executed directly from github. With the test-database created and the test-file downloaded, run the pipeline as follows:
# set this if you encounter a heap-space error to increase the memory that is used by nextflow
export NXF_OPTS="-Xms10g -Xmx15g" # increase or decrease the numbers as required
nextflow run mpieva/quicksand -r v2.5 \
--db refseq/kraken/Mito_db_kmer22/ \
--genomes refseq/genomes/ \
--bedfiles refseq/masked/ \
--split split/ \
-profile singularity #mind the single dash!Please see the documentation for a comprehensive description of the output files and structure!
The main summary table (final_report.tsv) contains one line per input file and detected family (passing the --krakenuniq_min_kmers and --krakenuniq_min_reads cutoff). The following columns are reported:
- RG: Name of the file analyzed
- ReadsRaw: Raw number of sequences in the file (paired reads only counted once)
- ReadsFiltered Number of sequences after filtering for
--bamfilterflag - ReadsLengthfiltered: Number of sequences after additonal filtering for sequence length (
--bamfilter_length_cutoff) - Kmers: KrakenUniq: Number of unique kmers used for classification (format: "best" and "(family)")
- KmerCoverage: KrakenUniq: Kmer coverage for that classification (format: "best" and "(family)")
- KmerDupRate: KrakenUniq: Kmer duplication rate for that classification (format: "best" and "(family)")
- ExtractLVL: "f" (family) or "o" (order), set by
--taxlvl - ReadsExtracted: Number of sequences assigned by KrakenUniq
- Order: Detected Order
- Family: Detected Family
- Species: The reference-genome used for the mapping of 'ReadsExtracted'
- Reference: "best" or "fixed" (
--fixed) - ReadsMapped: Number of sequences mapped, passing the mapping quality-cutoff(
--mapbwa_quality_cutoff) - ProportionMapped: ReadsMapped / ReadsExtracted
- ReadsDeduped: Number of unique sequences (removed PCR duplicates)
- DuplicationRate: ReadsMapped / ReadsDeduped
- CoveredBP:
covbasesstat of thesamtools coveragecommand. The number of bases covered in the reference genome by mapped sequences (max: ~17000) - ReadsBedfiltered: Number of unique sequences after applying low-complexity bed-filtering
- PostBedCoveredBP: Number of bases covered in the reference genome by bedfiltered sequences
- FamPercentage: Percentage of unique sequences in the alignment from the total number of unique sequences in the sample (For PSF-filter)
- Ancientness: "-", "+" or "++". Significance level of the C-to-T or G-to-A deamination rate in the alignment (for G-to-A, use
--doublestranded) - ReadsDeam(1term): Number of unique sequences with a C-to-T or G-to-A substitution in the terminal base positions (for G-to-A, use
--doublestranded) - ReadsDeam(3term): Number of unique sequences with a C-to-T or G-to-A substitution in the terminal three positions (for G-to-A, use
--doublestranded) - Deam5(95ci): The terminal C-to-T substitution-rate (and the 95% confidence interval) for the 5' end
- Deam3(95ci): The terminal C-to-T or G-to-A substitution-rate (and the 95% confidence interval) for the 3' end (for G-to-A, use
--doublestranded) - Deam5Cond(95ci): The terminal C-to-T substitution-rate (and the 95% confidence interval) for the 5' end, conditioned on a substitution at the opposite end
- Deam3Cond(95ci): The terminal C-to-T (or G-to-A) substitution-rate (and the 95% confidence interval) for the 3' end, conditioned on a substitution at the opposite end
- MeanFragmentLength: Mean fragment length of all unique sequences
- MeanFragmentLength(3term): Mean fragment length of all 'ancient' sequences
- Coverage:
meandepthof thesamtools coveragecommand. Corresponds to the depth of coverage in the alignment. - Breadth:
coverageof thesamtools coveragecommand / 100: The proportion of bases covered by mapped sequences in the reference genome - ExpectedBreadth: Expected breadth based on 'Coverage'. Calculated using the formula: ExpectedBreadth = 1 - e^(-0.833 × Coverage)
- ProportionExpectedBreadth: Breadth / ExpectedBreadth (For PEB-filter)
A collection of common nextflow-errors and how to solve them
-- Check '.nextflow.log' file for details
ERROR ~ Java heap space
Heap space errors can occur if nextflow itself requires more memory than provided by default (e.g. when screening too many samples in parallel). You can increase the heap-space as needed (e.g., to 5gb) with
export NXF_OPTS="-Xms5g -Xmx5g"
If you use quicksand in your research, please cite the quicksand publication as follows:
Szymanski, Merlin, Johann Visagie, Frederic Romagne, Matthias Meyer, and Janet Kelso.
“quick analysis of sedimentary ancient DNA using quicksand”, Molecular Biology and Evolution, 2025.
https://doi.org/10.1093/molbev/msaf305.
This pipeline uses code inspired by the nf-core initative, reused here under the MIT license.
Ewels, Philip A., Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso, and Sven Nahnsen. 2020. “The Nf-core Framework for Community-curated Bioinformatics Pipelines”. Nature Biotechnology 38 (3): 276–78. https://doi.org/10.1038/s41587-020-0439-x.
