Conversation
|
@nneune Great job as always! Technically looks good. I assume Emma or Richard will review the science. My only complaint is a few typos (see below), and that the dataset path might not accommodate future needs: there will permanently be only one dataset under I've been trying these new Ai-assisted dataset reviews. Posting in case useful:
TestingTry in Nextclade Web: ScienceBackground on the pathogen, its classification, epidemiology, and the reference strains used in this dataset. Provides context for evaluating dataset design decisions. Coxsackievirus A16 biology and classification [click to expand]Coxsackievirus A16 (CVA16) is a member of species Enterovirus A in the family Picornaviridae. It is one of the major causative agents of hand, foot, and mouth disease (HFMD), primarily affecting children under 5 (Sun et al., J Clin Microbiol 2014). The positive-sense single-stranded RNA genome is approximately 7,410 nt, encoding a single polyprotein cleaved into structural proteins (VP4, VP2, VP3, VP1) and non-structural proteins (2A, 2B, 2C, 3A, 3B, 3C, 3D), flanked by 5' and 3' UTRs (Xu et al., Front Microbiol 2025). The reference genome in this dataset, G-10 (GenBank U05876.1, 7413 nt), is the prototype CVA16 strain isolated in South Africa in 1951. It represents genotype A, the sole member of this clade (Sun et al., J Clin Microbiol 2014). As the README correctly notes, G-10 differs substantially from currently circulating strains, which belong to genotype B sublineages. VP1 is the standard molecular target for enterovirus typing and subgenogroup classification (Sun et al., J Clin Microbiol 2014), consistent with the dataset's Subgenogroup classification [click to expand]CVA16 phylogeny based on VP1 defines genotypes A, B, and D (Sun et al., J Clin Microbiol 2014). Genotype B is divided into B1 and B2, with B1 further split into B1a, B1b, and B1c clusters at 6.6-8.0% genetic distance (Zeng et al., Viruses 2025). Recombinant forms (sometimes labeled C-F) are described in Han et al., Virus Evol 2024. The dataset tree includes clades: A, B1, B1a, B1b, B1c, C, D, E, F, RFs, and unassigned. The README describes clades C-F as "recombinant forms" that cluster with the prototype strain (clade A), also known as B2, B3, and D in alternative nomenclatures. This is consistent with the recombination-driven genotype evolution described in Han et al., Virus Evol 2024. B1a and B1b co-circulated globally for decades; B1b became dominant in some regions after 2020 (Xu et al., Front Microbiol 2025). B1c, first reported in Southeast Asia and Europe after 2000, surged in China since 2023-2024 (Zeng et al., Viruses 2025) and was recently detected in Thailand for the first time in 2023 (Taoma et al., Microbiol Resour Announc 2025). The tree reflects this diversity with B1a (733 nodes), B1b (483), B1c (195), B1 (80), and smaller representation of D (43), C (5), RFs (4), F (3), E (3), and A (1). ENPEN and enterovirus Nextclade datasets [click to expand]The European Non-Polio Enterovirus Network (ENPEN), under the European Society for Clinical Virology, coordinates enterovirus surveillance across 20+ European countries (Harvala et al., Microorganisms 2021). A 2025 study in The Lancet Regional Health - Europe analyzed 63,659 samples from 48 countries (2015-2022), with ENPEN contributing 85% of typed non-polio enterovirus data (Harvala et al., Lancet Reg Health Eur 2025). This CVA16 dataset is the second ENPEN enterovirus dataset for Nextclade, following EV-D68. The build pipeline is available at enterovirus-phylo/nextclade_a16, adapted from the EV-D68 pipeline template. The same team (Neuner-Jehle, Gonzalez-Sanchez, Hodcroft) maintains both datasets. Blocking issuesIssues affecting scientific correctness, data integrity, or user-facing accuracy. These block adoption of the dataset until addressed. 🔴 H1. Dataset path naming inconsistent with sibling dataset [click to expand]The dataset path The EV-D68 dataset is already released (3 versions) and its flat convention (no reference accession suffix) is locked. The reference suffix question is moot for CVA16 if ENPEN follows the same pattern. Effect: Once released, the path cannot be renamed. Fix: Consider Non-blocking issuesCosmetic issues, minor inconsistencies, and documentation improvements. Fix if time allows. 🟡 M1. Typo "Cocksackievirus" in README [click to expand]
Fix: Change "Cocksackievirus" to "Coxsackievirus" on line 28. 🟡 M2. `reference name` attribute duplicates pathogen name [click to expand]In The EV-D68 dataset uses the FASTA header description: The Fix: Change 🟡 M3. Consider `experimental` flag given testing status [click to expand]The PR description states: "It needs to be tested further by ENPEN and others." The dataset has no If the dataset is intended for broader testing before full release, setting Fix: Consider adding 🟡 M4. Three files missing trailing newlines [click to expand]The following files lack a trailing newline:
Fix: Add a trailing newline to each file. 🔵 L1. No citation section in README [click to expand]The EV-D68 README includes a "Citation" section with a recommended citation. The CVA16 README omits this. Fix: Add a citation section referencing the dataset authors and the workflow repository, consistent with the EV-D68 README. NotesObservations that require no action: correct design decisions, positive patterns, comparisons with related work, and future improvement ideas. Click to expand
Nextclade CLI runNextclade CLI run via Docker ( Reference sequence [click to expand]The reference ( Example sequences (35 total) [click to expand]
Clade assignments: B1b (12), B1a (9), B1c (7), B1 (5), C (1), D (1). All 35 received a clade assignment. Private mutations: min=0, max=377, mean=51.5. 5/35 exceed the threshold of 120. Three sequences (PX448982, PX448985, PX448978) have >200 private mutations, suggesting they are divergent from the nearest tree node. Frameshifts: 4/35 have frameshifts. No
These are large frameshifts spanning most of each CDS. They are more likely sequencing artifacts or incomplete sequences than biological frameshifts. If they are expected in the example set, adding SNP clusters: 5/35 flagged as bad, 3/35 as mediocre. The Stop codons: 1/35 has a premature stop codon. Missing data: 0 across all 35 sequences. |
Claude is being a bit too dramatic here 😆 |
|
Oh wow, the AI did a great job at summarizing the science behind CVA16. For the "dataset path naming" inconsistency, CVA16 is actually the official name and not CV-A16 (see Simmonds et al., 2020). I'll correct the typos! |
nneune
left a comment
There was a problem hiding this comment.
The same issues exist as with the EV-D68 dataset. Divergence is not a valid QC label, and the nucMutLabelMapReverse is deprecated.
This pull request introduces a new Nextclade dataset for Coxsackievirus A16 (CVA16), based on the reference strain "G-10". It provides all the essential files and documentation required for lineage classification, phylogenetic analysis, and quality control of CVA16 sequences. The dataset is tailored for broad subgenogroup assignment and includes detailed metadata, genome annotation, and configuration for Nextclade compatibility.
It needs to be tested further by ENPEN and others.