transcluster

A Snakemake pipeline to find SARS-CoV-2 clusters in a reference phylogeny.

It requires an input directory (INPUT_DIR parameter, default: input/) containing at least one .txt file with a list of sample IDs (one per line). Each ID list is evaluated separately, possibly in parallel.

The sequence records from each list of target samples are extracted from a FASTA file containing an aligned sequence dataset (provided through the ALIGNED_FULL_FASTA parameter), and placed on a reference phylogeny in Newick format (REFERENCE_TREE parameter), using UShER. Sample identifiers are transformed beforehand so that samples are deduplicated automatically.

Then, the pipeline looks for clusters using a breadth-first search approach, skipping clades that do not conform to the parametrized clustering criteria (MIN_SIZE and MIN_PROP parameters, default: ≥2 members and ≥90% proportion of target samples), using tcfinder.

Additionally, a normalized clustering index is estimated for each cluster by calculating the the ratio of the number of tips in the cluster to the number of background sequences (by default, those deposited in GISAID) during the same time period. The metric can be interpreted as an estimation of the transmission fitness in certain cases. These "background" samples include those collected between the first and last case within the cluster. The time window is symmetrically padded with a specified number of days (FITNESS_PADDING_DAYS parameter, default: 0 and 7). The denominator helps control for the effect of uneven sequencing efforts across different times and locations, and is calculated in two ways:

Adding the number of background samples from all the countries where the cluster was observed.
Slicing the cluster into each country where the cluster was observed, separately calculating the background samples for each country with its different time windows, and then adding them together.

Finally, a series of figures (including sample timelines, comparisons of normalized clustering index and missing data), are generated. Results are written to a subdirectory within the output directory (OUTPUT_DIR parameter, default: output/).

All parameters have a default value except for ALIGNED_FULL_FASTA and METADATA, which must be provided by the user. Parameters can be set and overridden by modifying the config.yaml file or through the --config command line argument (see the Snakemake docs).

Usage

Local execution

The workflow requires conda/mamba and snakemake. The rest of dependencies are defined in YAML environment files and can be automatically installed upon execution through the command line argument --use-conda.

To run the pipeline using 4 cores, with the default workflow profile:

snakemake --use-conda -c 4

Cluster or cloud execution

The workflow can be easily run on a cluster or in the cloud through Snakemake executor plugins. Further configuration can be set using Snakemake profiles.

We provide the run-slurm.sh Slurm batch script as an example. It requires the slurm executor plugin. Note that the SBATCH options and the profile definition might need to be modified to meet the user's needs depending on the particular cluster.

Configuration parameters

`ALIGNED_FULL_FASTA`

Path to the aligned sequence dataset in FASTA format. For each set of target samples, sequences are sourced from this file according to their ID. Sequences must be aligned to REFERENCE_FASTA.

`METADATA`

Path to the GISAID metadata of the input sequences in tabular format.

`METADATA_COLS`

`FULL_METADATA`

This section defines the GISAID metadata column names. GISAID may update their column names over time, so these settings ensure compatibility.

ORIGINAL_ID: the sample identifier as it appears in the sequence record in the FASTA file (default: "Virus name").
UNIQUE_ID: a unique identifier guaranteed to be specific to each sample (default: "Accession ID", the EPI_ISL IDs).
DATE: date when the sample was collected (default: "Collection date").
LOCATION: location where the sample was collected, either in GISAID-like format or just a single geographic denomination (default: "Location").
GENDER: gender of the host or patient from whom the sample was collected (default: "Gender").
LINEAGE: viral lineage assigned to the sample (default: "Pango lineage").
AGE: age of the host or patient from whom the sample was collected (default: "Patient age").

`SEPARATE_LOCATION`

Boolean that decides whether to separate GISAID-like location column (/-separated) or not.

`REFERENCE_TREE`

URL or path pointing to a reference phylogeny to place the target samples in.

Select a reference tree containing the appropriate GISAID sequences up to a date by providing the following REFERENCE_TREE (replace YYYY, MM and DD according to the desired threshold date):

https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/YYYY/MM/DD/public-YYYY-MM-DD.all.masked.pb.gz

Alternatively, select a locally stored tree in UShER protobuf format by providing its path.

`PROBLEMATIC_VCF`

Path or URL pointing to a VCF file containing problematic sites.

`REFERENCE_FASTA`

Path to the reference sequence in FASTA format.

`INPUT_DIR` and `OUTPUT_DIR`

Path to the input and output directories, respectively.

`MIN_PROP` and `MIN_SIZE`

Thresholds for cluster detections: the minimum proportion of target samples (from 0 to 1, inclusive) and the minimum number of samples within a cluster, respectively.

`SPACE_REPLACEMENT`

A string used to replace spaces in FASTA records to avoid truncation of sequence descriptors during the analysis.

`FITNESS_PADDING_DAYS`

A list of integers that specify the number of days to pad the normalized clustering index denominator time windows with.

Testing the pipeline

The workflow can be run with test data by executing test-local.sh.

Citation

This tool was originally developed to support this work:

Álvarez-Herrera M, Ruiz-Rodriguez P, Navarro-Domínguez B, Zulaica J, Grau B, Bracho MA, Guerreiro M, Aguilar-Gallardo C, González-Candelas F, Comas I, Geller R & Coscollá M (2025). Genome data artifacts and functional studies of deletion repair in the BA.1 SARS-CoV-2 spike protein. Virus Evolution, 11(1), veaf015. https://doi.org/10.1093/ve/veaf015

If you find the tool helpful, please feel free to cite the above (see also CITATION.cff).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

transcluster

Usage

Local execution

Cluster or cloud execution

Configuration parameters

`ALIGNED_FULL_FASTA`

`METADATA`

`METADATA_COLS`

`FULL_METADATA`

`SEPARATE_LOCATION`

`REFERENCE_TREE`

`PROBLEMATIC_VCF`

`REFERENCE_FASTA`

`INPUT_DIR` and `OUTPUT_DIR`

`MIN_PROP` and `MIN_SIZE`

`SPACE_REPLACEMENT`

`FITNESS_PADDING_DAYS`

Testing the pipeline

Citation

About

Uh oh!

Releases 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
.github/workflows		.github/workflows
config		config
data		data
docs		docs
profiles		profiles
test		test
workflow		workflow
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
run-slurm.sh		run-slurm.sh
test-local.sh		test-local.sh

License

PathoGenOmics-Lab/transcluster

Folders and files

Latest commit

History

Repository files navigation

transcluster

Usage

Local execution

Cluster or cloud execution

Configuration parameters

ALIGNED_FULL_FASTA

METADATA

METADATA_COLS

FULL_METADATA

SEPARATE_LOCATION

REFERENCE_TREE

PROBLEMATIC_VCF

REFERENCE_FASTA

INPUT_DIR and OUTPUT_DIR

MIN_PROP and MIN_SIZE

SPACE_REPLACEMENT

FITNESS_PADDING_DAYS

Testing the pipeline

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Languages

`ALIGNED_FULL_FASTA`

`METADATA`

`METADATA_COLS`

`FULL_METADATA`

`SEPARATE_LOCATION`

`REFERENCE_TREE`

`PROBLEMATIC_VCF`

`REFERENCE_FASTA`

`INPUT_DIR` and `OUTPUT_DIR`

`MIN_PROP` and `MIN_SIZE`

`SPACE_REPLACEMENT`

`FITNESS_PADDING_DAYS`