Skip to content

Genotyping Tools

Robert J. Gifford edited this page Oct 11, 2024 · 8 revisions

Introduction to Maximum Likelihood Clade Assignment (MLCA)

HCV-GLUE employs a robust genotyping method called Maximum Likelihood Clade Assignment (MLCA) to assign HCV sequences to genotypes and subtypes.

MLCA is based on the Evolutionary Placement Algorithm (EPA), a feature of the highly optimized RAxML software. RAxML typically generates complete phylogenetic trees from multiple sequence alignments, but EPA allows for efficient clade assignment by placing new sequences onto an existing reference tree without recalculating the entire phylogeny. This efficiency makes EPA well-suited for virus sequence clade assignment, forming the foundation of the MLCA method integrated into GLUE.

Example Usage in HCV-GLUE

The genotyping process in HCV-GLUE can be executed through the command-line interface. Below is an example of using the MLCA genotyping module within HCV-GLUE:

Mode path: /project/hcv
GLUE> module maxLikelihoodGenotyper genotype file -f WebContent/exampleSequences/exampleSequences.fasta

This command processes the sequences in the specified FASTA file and outputs the assigned genotype and subtype clades for each sequence:

+===========+====================+===================+
| queryName | genotypeFinalClade | subtypeFinalClade |
+===========+====================+===================+
| EF407428  | AL_1               | AL_1a             |
| KT735183  | AL_3               | AL_3a             |
+===========+====================+===================+

In this example, the sequence EF407428 is assigned to genotype AL_1 and subtype AL_1a, while sequence KT735183 is assigned to genotype AL_3 and subtype AL_3a.

The MLCA Algorithm

The MLCA algorithm operates in three stages: alignment, placement, and neighbor-weighting. Each stage plays a crucial role in accurately assigning query sequences to predefined clades.

  1. Alignment Stage:
    The first step involves aligning the query sequences to a reference set of HCV sequences. This is achieved using the MAFFT software, specifically the --add and --keeplength options, which integrate query sequences into the existing multiple sequence alignment without altering the original alignment's structure. Each query sequence is aligned independently, ensuring that the alignment computations remain isolated for each sequence.

  2. Placement Stage:
    In the placement stage, the extended alignment from the previous step is combined with a fixed reference tree. For each query sequence, the algorithm identifies potential placements on the tree that maximize the likelihood of the extended tree structure. Using RAxML's EPA subsystem, the algorithm inserts the query sequence at various points on the tree, optimizing the branch lengths and positions to find the most likely placements. A small set of high-likelihood placements is retained for further analysis.

  3. Neighbor-Weighting Stage:
    The final stage of the MLCA algorithm is neighbor-weighting, which summarizes the placement results by calculating clade weightings for each query sequence. The algorithm evaluates the evolutionary distance between the query sequence and its closest neighboring reference sequences. Since these neighbors are already assigned to specific clades, their proximity provides evidence for the query sequence's clade assignment. The closer the neighbor, the stronger the evidence. The algorithm then assigns the query sequence to the clade if the calculated weight exceeds a predefined threshold.

    This neighbor-weighting mechanism relies on the evolutionary distances in the phylogenetic tree, where shorter branch lengths indicate closer genetic relationships. By focusing on nearby reference sequences, the algorithm effectively assigns query sequences to the most appropriate clades based on genetic similarity.

Benefits of Using MLCA for HCV Genotyping

The integration of MLCA within HCV-GLUE offers a powerful and efficient tool for HCV genotyping. By leveraging the EPA feature of RAxML and the structured approach of MLCA, the method provides a high level of accuracy and computational efficiency, making it well-suited for large-scale sequence analysis in both research and clinical settings.