Skip to content

Initial upload of CVA16 dataset#412

Open
nneune wants to merge 13 commits intomasterfrom
enpen-cva16
Open

Initial upload of CVA16 dataset#412
nneune wants to merge 13 commits intomasterfrom
enpen-cva16

Conversation

@nneune
Copy link
Collaborator

@nneune nneune commented Mar 3, 2026

This pull request introduces a new Nextclade dataset for Coxsackievirus A16 (CVA16), based on the reference strain "G-10". It provides all the essential files and documentation required for lineage classification, phylogenetic analysis, and quality control of CVA16 sequences. The dataset is tailored for broad subgenogroup assignment and includes detailed metadata, genome annotation, and configuration for Nextclade compatibility.

It needs to be tested further by ENPEN and others.

@nneune nneune had a problem deploying to refs/pull/412/merge March 3, 2026 09:49 — with GitHub Actions Failure
@nneune nneune temporarily deployed to refs/pull/412/merge March 3, 2026 09:58 — with GitHub Actions Inactive
@nneune nneune deployed to refs/pull/412/merge March 3, 2026 11:25 — with GitHub Actions Active
@ivan-aksamentov
Copy link
Member

@nneune Great job as always! Technically looks good. I assume Emma or Richard will review the science.

My only complaint is a few typos (see below), and that the dataset path might not accommodate future needs: there will permanently be only one dataset under enpen/enterovirus/cva16 path, only for this particular strain and nothing else. But it's up to you guys to decide how to structure.

I've been trying these new Ai-assisted dataset reviews. Posting in case useful:


⚠️ AI-generated content below. Verify all claims.

Testing

Try in Nextclade Web:

Science

Background on the pathogen, its classification, epidemiology, and the reference strains used in this dataset. Provides context for evaluating dataset design decisions.

Coxsackievirus A16 biology and classification [click to expand]

Coxsackievirus A16 (CVA16) is a member of species Enterovirus A in the family Picornaviridae. It is one of the major causative agents of hand, foot, and mouth disease (HFMD), primarily affecting children under 5 (Sun et al., J Clin Microbiol 2014). The positive-sense single-stranded RNA genome is approximately 7,410 nt, encoding a single polyprotein cleaved into structural proteins (VP4, VP2, VP3, VP1) and non-structural proteins (2A, 2B, 2C, 3A, 3B, 3C, 3D), flanked by 5' and 3' UTRs (Xu et al., Front Microbiol 2025).

The reference genome in this dataset, G-10 (GenBank U05876.1, 7413 nt), is the prototype CVA16 strain isolated in South Africa in 1951. It represents genotype A, the sole member of this clade (Sun et al., J Clin Microbiol 2014). As the README correctly notes, G-10 differs substantially from currently circulating strains, which belong to genotype B sublineages.

VP1 is the standard molecular target for enterovirus typing and subgenogroup classification (Sun et al., J Clin Microbiol 2014), consistent with the dataset's defaultCds: "VP1" setting.

Subgenogroup classification [click to expand]

CVA16 phylogeny based on VP1 defines genotypes A, B, and D (Sun et al., J Clin Microbiol 2014). Genotype B is divided into B1 and B2, with B1 further split into B1a, B1b, and B1c clusters at 6.6-8.0% genetic distance (Zeng et al., Viruses 2025). Recombinant forms (sometimes labeled C-F) are described in Han et al., Virus Evol 2024.

The dataset tree includes clades: A, B1, B1a, B1b, B1c, C, D, E, F, RFs, and unassigned. The README describes clades C-F as "recombinant forms" that cluster with the prototype strain (clade A), also known as B2, B3, and D in alternative nomenclatures. This is consistent with the recombination-driven genotype evolution described in Han et al., Virus Evol 2024.

B1a and B1b co-circulated globally for decades; B1b became dominant in some regions after 2020 (Xu et al., Front Microbiol 2025). B1c, first reported in Southeast Asia and Europe after 2000, surged in China since 2023-2024 (Zeng et al., Viruses 2025) and was recently detected in Thailand for the first time in 2023 (Taoma et al., Microbiol Resour Announc 2025). The tree reflects this diversity with B1a (733 nodes), B1b (483), B1c (195), B1 (80), and smaller representation of D (43), C (5), RFs (4), F (3), E (3), and A (1).

ENPEN and enterovirus Nextclade datasets [click to expand]

The European Non-Polio Enterovirus Network (ENPEN), under the European Society for Clinical Virology, coordinates enterovirus surveillance across 20+ European countries (Harvala et al., Microorganisms 2021). A 2025 study in The Lancet Regional Health - Europe analyzed 63,659 samples from 48 countries (2015-2022), with ENPEN contributing 85% of typed non-polio enterovirus data (Harvala et al., Lancet Reg Health Eur 2025).

This CVA16 dataset is the second ENPEN enterovirus dataset for Nextclade, following EV-D68. The build pipeline is available at enterovirus-phylo/nextclade_a16, adapted from the EV-D68 pipeline template. The same team (Neuner-Jehle, Gonzalez-Sanchez, Hodcroft) maintains both datasets.

Blocking issues

Issues affecting scientific correctness, data integrity, or user-facing accuracy. These block adoption of the dataset until addressed.

🔴 H1. Dataset path naming inconsistent with sibling dataset [click to expand]

The dataset path enpen/enterovirus/cva16 uses an undashed name, while the sibling EV-D68 dataset uses enpen/enterovirus/ev-d68 (dashed). Dataset paths are immutable after release and cannot be changed. The curation guide recommends consistency: "choose between 'flu' and 'influenza', stick to it."

The EV-D68 dataset is already released (3 versions) and its flat convention (no reference accession suffix) is locked. The reference suffix question is moot for CVA16 if ENPEN follows the same pattern.

Effect: Once released, the path cannot be renamed. cva16 vs ev-d68 sets an inconsistent naming pattern in the ENPEN collection.

Fix: Consider enpen/enterovirus/cva-16 (with dash, matching ev-d68 convention).

Non-blocking issues

Cosmetic issues, minor inconsistencies, and documentation improvements. Fix if time allows.

🟡 M1. Typo "Cocksackievirus" in README [click to expand]

data/enpen/enterovirus/cva16/README.md:28:1: the subheading reads "Subgenogroups of Cocksackievirus A16" instead of "Coxsackievirus".

Fix: Change "Cocksackievirus" to "Coxsackievirus" on line 28.

🟡 M2. `reference name` attribute duplicates pathogen name [click to expand]

In data/enpen/enterovirus/cva16/pathogen.json:15:1:, the attributes["reference name"] is set to "Coxsackievirus A16", identical to attributes.name. This does not help users distinguish the reference strain from the pathogen name.

The EV-D68 dataset uses the FASTA header description: "Human enterovirus 68 strain Fermon, complete genome.". The CVA16 reference FASTA header is >U05876.1 coxsackievirus A16 G-10, complete genome. A more informative value would be "G-10" or "Coxsackievirus A16 strain G-10".

The data_output/index.json also inherits this duplicated name.

Fix: Change "reference name" to "G-10" or "Coxsackievirus A16 strain G-10" in pathogen.json.

🟡 M3. Consider `experimental` flag given testing status [click to expand]

The PR description states: "It needs to be tested further by ENPEN and others." The dataset has no experimental flag set, and the generated index shows "enabled": true.

If the dataset is intended for broader testing before full release, setting "experimental": true in pathogen.json would signal this to users. This is consistent with the "unreleased" version tag already in use.

Fix: Consider adding "experimental": true to pathogen.json if the dataset is not yet ready for production use.

🟡 M4. Three files missing trailing newlines [click to expand]

The following files lack a trailing newline:

  • data/enpen/enterovirus/cva16/CHANGELOG.md
  • data/enpen/enterovirus/cva16/README.md
  • data/enpen/enterovirus/cva16/tree.json

Fix: Add a trailing newline to each file.

🔵 L1. No citation section in README [click to expand]

The EV-D68 README includes a "Citation" section with a recommended citation. The CVA16 README omits this.

Fix: Add a citation section referencing the dataset authors and the workflow repository, consistent with the EV-D68 README.

Notes

Observations that require no action: correct design decisions, positive patterns, comparisons with related work, and future improvement ideas.

Click to expand
  • All 11 CDS regions in genome_annotation.gff3 have lengths divisible by 3 and zero ambiguous bases. Coordinates match between GFF3 and tree.json meta.genome_annotations exactly. The CDS regions represent enterovirus polyprotein cleavage products: only VP4 has an ATG start codon (polyprotein start) and only 3D is followed by a TAG stop codon. This is correct for a picornavirus single-ORF genome.
  • Reference U05876.1 is 7413 nt with a 750 nt 5' UTR, 6579 nt coding region (VP4 through 3D), and 84 nt 3' UTR, matching expected CVA16 genome organization.
  • The reference sequence U05876 is present as a leaf node in the tree. This means Nextclade will not accumulate spurious private mutations when analyzing the reference itself.
  • The tree contains 777 leaf nodes across 11 clades, spanning dates 1997 to 2024. Clade distribution (B1a: 733, B1b: 483, B1c: 195) reflects current global CVA16 epidemiology with B1a/B1b dominance and emerging B1c (Zeng et al., Viruses 2025).
  • The Static Inferred Ancestor approach (via outgroup rooting) is well-documented in meta.extensions.nextclade and the README, matching the EV-D68 dataset pattern.
  • defaultCds: "VP1" is appropriate - VP1 is the standard target for enterovirus molecular typing (Sun et al., J Clin Microbiol 2014).
  • QC thresholds are reasonable for a ~7.4 kb genome: missingDataThreshold: 1000 (~13.5%), privateMutations cutoff 120 (~1.6%), maxDivergence: 0.15.
  • 21/35 example sequences are tree leaves, 14 placed at runtime. This is expected behavior for Nextclade.
  • No alignmentPreset or ignoredFrameShifts are set (see dynamic validation below for frameshift observations).
  • The snpClusters QC rule (absent in EV-D68) is enabled with windowSize: 100, clusterCutOff: 4.
  • The mutLabels section contains extensive nucleotide mutation labels mapping mutations to subgenogroups (B1a, B1b, B1c, B1, C, D, E, F), enabling lineage-defining mutation annotation.
  • The compatibility field uses "3.0.0" (matching EV-D68), not "3.0.0-alpha.0".
  • Workflow source repository enterovirus-phylo/nextclade_a16 exists and is publicly accessible.

Nextclade CLI run

Nextclade CLI run via Docker (nextstrain/nextclade) against the dataset.

Reference sequence [click to expand]

The reference (U05876.1) passes all QC checks with overall status good (score 0). Zero private mutations, zero frameshifts, zero stop codons, zero missing data. Clade assigned: A. This confirms the reference is correctly represented in the tree as a leaf node.

Example sequences (35 total) [click to expand]
QC status Count Fraction
good 23 65.7%
mediocre 2 5.7%
bad 10 28.6%

Clade assignments: B1b (12), B1a (9), B1c (7), B1 (5), C (1), D (1). All 35 received a clade assignment.

Private mutations: min=0, max=377, mean=51.5. 5/35 exceed the threshold of 120. Three sequences (PX448982, PX448985, PX448978) have >200 private mutations, suggesting they are divergent from the nearest tree node.

Frameshifts: 4/35 have frameshifts. No ignoredFrameShifts are configured, so all trigger QC warnings:

Sequence CDS Codon range
PX448822 VP1 165-297
PX448982 VP2 160-254
PX449037 2A 135-150
PX448850 VP3 34-242

These are large frameshifts spanning most of each CDS. They are more likely sequencing artifacts or incomplete sequences than biological frameshifts. If they are expected in the example set, adding ignoredFrameShifts entries would suppress the QC warnings.

SNP clusters: 5/35 flagged as bad, 3/35 as mediocre. The snpClusters rule (windowSize: 100, clusterCutOff: 4) flags sequences with concentrated mutations. This is expected for divergent examples and does not indicate misconfiguration.

Stop codons: 1/35 has a premature stop codon.

Missing data: 0 across all 35 sequences.

@ivan-aksamentov
Copy link
Member

🔴 H1. Dataset path naming inconsistent with sibling dataset

Claude is being a bit too dramatic here 😆

@nneune
Copy link
Collaborator Author

nneune commented Mar 4, 2026

Oh wow, the AI did a great job at summarizing the science behind CVA16. For the "dataset path naming" inconsistency, CVA16 is actually the official name and not CV-A16 (see Simmonds et al., 2020). I'll correct the typos!

@ivan-aksamentov ivan-aksamentov deployed to refs/heads/enpen-cva16 March 4, 2026 17:27 — with GitHub Actions Active
Copy link
Collaborator Author

@nneune nneune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issues exist as with the EV-D68 dataset. Divergence is not a valid QC label, and the nucMutLabelMapReverse is deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants