Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
9d704d2
Initial upload of CVA16 dataset
nneune Mar 2, 2026
a7a8a05
fix: rename recombinant branch to "RFs"
nneune Mar 3, 2026
7e42af5
fix: pathogen issues
nneune Mar 3, 2026
e41d9dc
chore: rebuild [skip ci]
nextstrain-bot Mar 3, 2026
fec4b05
fix links in dataset description
nneune Mar 3, 2026
697ee3c
Merge branch 'enpen-cva16' of github.com:nextstrain/nextclade_data in…
nneune Mar 3, 2026
96d6364
chore: rebuild [skip ci]
nextstrain-bot Mar 3, 2026
142d89d
fix author links
nneune Mar 3, 2026
a1235bc
fix typos and missing line endings
nneune Mar 4, 2026
c6fe883
fix: clade B2 was replaced by C
nneune Mar 4, 2026
8bec9db
chore: trigger CI
ivan-aksamentov Mar 4, 2026
dcdfee3
chore: rebuild [skip ci]
nextstrain-bot Mar 4, 2026
7724026
fix: remove nucMutLabelMapReverse and divergence qc
nneune Mar 5, 2026
bc334a8
Merge branch 'master' into enpen-cva16
nneune Apr 15, 2026
5efec0f
Increase minSeedCover, and privateMutations thresholds and use ancest…
nneune Apr 19, 2026
a451882
chore: rebuild [skip ci]
nextstrain-bot Apr 19, 2026
c300561
Update data/enpen/enterovirus/cva16/README.md
nneune Apr 19, 2026
318944c
chore: rebuild [skip ci]
nextstrain-bot Apr 19, 2026
8e9cc59
update README: dataset uses ancestral sequence as reference. Remove r…
nneune Apr 19, 2026
76c44da
chore: rebuild [skip ci]
nextstrain-bot Apr 19, 2026
d3fc423
update README: clarify reference terminology for ancestral sequence
nneune Apr 20, 2026
783d09d
chore: rebuild [skip ci]
nextstrain-bot Apr 20, 2026
4acb200
update README: clarify reference terminology for ancestral sequence
nneune Apr 20, 2026
73c01e2
chore: rebuild [skip ci]
nextstrain-bot Apr 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion data/enpen/collection.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
]
},
"dataset_order": [
"enpen/enterovirus/ev-d68"
"enpen/enterovirus/ev-d68",
"enpen/enterovirus/cva16"
]
}
5 changes: 5 additions & 0 deletions data/enpen/enterovirus/cva16/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Unreleased

Initial release of a Coxsackievirus A16 dataset for lineage classification!

Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
66 changes: 66 additions & 0 deletions data/enpen/enterovirus/cva16/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Coxsackievirus A16 dataset

| Key | Value |
|----------------------|-----------------------------------------------------------------------|
| authors | [Nadia Neuner-Jehle](https://eve-lab.org/people/nadia-neuner-jehle/), [Alejandra González-Sánchez](https://www.vallhebron.com/en/professionals/alejandra-gonzalez-sanchez), [Emma B. Hodcroft](https://eve-lab.org/people/emma-hodcroft/), [ENPEN](https://escv.eu/european-non-polio-enterovirus-network-enpen/) |
| name | Coxsackievirus A16 |
| reference | [Static Inferred Ancestor](https://github.com/enterovirus-phylo/nextclade_a16/blob/master/resources/inferred-root.fasta) |
| workflow | https://github.com/enterovirus-phylo/nextclade_a16 |
| path | `enpen/enterovirus/cva16` |
| clade definitions | A–F |

## Scope of this dataset

This dataset uses the [Static Inferred Ancestor](https://github.com/enterovirus-phylo/nextclade_a16/blob/master/resources/inferred-root.fasta) instead of the historical G-10 prototype sequence ([U05876.1](https://www.ncbi.nlm.nih.gov/nuccore/U05876)). It is intended for broad subgenogroup classification, mutation quality control, and phylogenetic analysis of CVA16 diversity.

*Note: The G-10 reference differs substantially from currently circulating strains.* This is common for enterovirus datasets, in contrast to some other virus datasets (e.g., seasonal influenza), where the reference is updated more frequently to reflect recent lineages.

To address this, the dataset is *rooted* on a Static Inferred Ancestor, a phylogenetically reconstructed ancestral sequence near the tree root. This provides a stable reference point that can be used as an alternative for mutation calling.
Comment on lines +14 to +18
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset README states the dataset uses a “Static Inferred Ancestor” instead of the G-10 prototype, but the accompanying genome annotation (U05876.1) and the generated data_output/ dataset currently indicate G-10/U05876.1 as the reference. Please clarify which reference sequence is intended and update README and dataset metadata consistently (README, pathogen.json attributes, reference FASTA header/accession).

Copilot uses AI. Check for mistakes.

## Features

This dataset supports:

- Assignment of subgenotypes
- Phylogenetic placement
- Sequence quality control (QC)

## Subgenogroups of Coxsackievirus A16

Subgenogroups B1a, B1b and B1c represent the major phylogenetic divisions of CVA16 and are commonly used in virological surveillance and the literature. They are defined based on phylogenetic clustering and do not necessarily reflect antigenic differences.

In recent years, additional recombinant forms have been identified and labeled C-F (also referred to as B2, B3, and D). These recombinant forms cluster with the prototype strain (clade A).

Overall, these designations are based on phylogenetic structure and characteristic mutations, and are widely used in molecular epidemiology, similar to subgenotype systems for other enteroviruses. Unlike influenza (H1N1, H3N2) or SARS-CoV-2, there is no universally standardized global lineage nomenclature for enteroviruses; naming instead follows conventions established in published studies and surveillance practices.

## Related Enteroviruses

CVA16 is closely related to other EV-A viruses, including EV-A71, EV-A120, and CVA5. If you are not certain that your sequences contain only CVA16, we recommend using the "[Multiple Datasets](https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-web/getting-started.html#multi-dataset-mode)" tab instead of "Single Dataset".

This prevents Nextclade from forcing sequences to align to the CVA16 reference tree. For example, EV-A71 sequences may still align and receive a clade assignment (often near recombinant forms).

Please be cautious when working with short genes or fragments (e.g., 5'UTR sequences). These regions can be highly conserved across EV-A viruses, making genogroup and subgenogroup assignment prone to errors. In addition, such fragments may originate from recombinant genomes. Recombination is common in enteroviruses, and when analyzing only a fragment, this may go undetected.

If you are unsure how to proceed, please contact us. We are happy to assist.

## Reference types

This dataset includes several reference points used in analyses:
- *Static Inferred Ancestor:* Reconstructed ancestral sequence inferred with an outgroup, representing the likely founder of CVA16. Serves as a stable reference.

- *Parent:* The nearest ancestral node of a sample in the tree, used to infer branch-specific mutations.

- *Clade founder:* The inferred ancestral node defining a clade (e.g., B1a, B2). Mutations "since clade founder" describe changes that define that clade.

- *Reference:* RefSeq or similarly established prototype sequence. Here G-10 (U05876.1).

- *Tree root:* Corresponds to the root of the tree, it may change in future updates as more data become available.

All references use the coordinate system of the G-10 sequence.

## Issues & Contact
- For questions or suggestions, please [open an issue](https://github.com/enterovirus-phylo/nextclade_a16/issues) or email: eve-group[at]swisstph.ch

## What is a Nextclade dataset?

A Nextclade dataset includes the reference sequence, genome annotations, tree, clade definitions, and QC rules. Learn more in the [Nextclade documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html).
17 changes: 17 additions & 0 deletions data/enpen/enterovirus/cva16/genome_annotation.gff3
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region U05876.1 1 7413
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31704
U05876.1 Genbank region 1 7413 . + . ID=U05876.1:1..7413;Dbxref=taxon:31704;gb-acronym=CV-A16;gbkey=Src;mol_type=genomic RNA;nat-host=Homo sapiens;strain=G-10
U05876.1 Genbank CDS 751 957 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAA50478.1:1..69
U05876.1 Genbank CDS 958 1719 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAA50478.1:70..323
U05876.1 Genbank CDS 1720 2445 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAA50478.1:324..565
U05876.1 Genbank CDS 2446 3336 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAA50478.1:566..862
U05876.1 Genbank CDS 3337 3786 . + . Name=2A;product=2A;gbkey=Prot;ID=id-AAA50478.1:863..1012
U05876.1 Genbank CDS 3787 4083 . + . Name=2B;product=2B;gbkey=Prot;ID=id-AAA50478.1:1013..1111
U05876.1 Genbank CDS 4084 5070 . + . Name=2C;product=2C;gbkey=Prot;ID=id-AAA50478.1:1112..1440
U05876.1 Genbank CDS 5071 5328 . + . Name=3A;product=3A;gbkey=Prot;ID=id-AAA50478.1:1441..1526
U05876.1 Genbank CDS 5329 5394 . + . Name=3B;product=3B;gbkey=Prot;ID=id-AAA50478.1:1527..1548
U05876.1 Genbank CDS 5395 5943 . + . Name=3C;product=3C;gbkey=Prot;ID=id-AAA50478.1:1549..1731
U05876.1 Genbank CDS 5944 7329 . + . Name=3D;product=3D;gbkey=Prot;ID=id-AAA50478.1:1732..2193
Comment on lines +4 to +17
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this dataset, the GFF3 seqid is U05876.1, but the reference FASTA header in reference.fasta is ancestral_sequence. Nextclade expects the genome annotation seqid values to match the reference sequence ID; otherwise CDS translation/annotation lookup can fail. Please make the FASTA ID and all first-column GFF3 IDs consistent (either rename the FASTA header to U05876.1 or change the GFF3 seqid/##sequence-region to ancestral_sequence).

Suggested change
##sequence-region U05876.1 1 7413
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31704
U05876.1 Genbank region 1 7413 . + . ID=U05876.1:1..7413;Dbxref=taxon:31704;gb-acronym=CV-A16;gbkey=Src;mol_type=genomic RNA;nat-host=Homo sapiens;strain=G-10
U05876.1 Genbank CDS 751 957 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAA50478.1:1..69
U05876.1 Genbank CDS 958 1719 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAA50478.1:70..323
U05876.1 Genbank CDS 1720 2445 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAA50478.1:324..565
U05876.1 Genbank CDS 2446 3336 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAA50478.1:566..862
U05876.1 Genbank CDS 3337 3786 . + . Name=2A;product=2A;gbkey=Prot;ID=id-AAA50478.1:863..1012
U05876.1 Genbank CDS 3787 4083 . + . Name=2B;product=2B;gbkey=Prot;ID=id-AAA50478.1:1013..1111
U05876.1 Genbank CDS 4084 5070 . + . Name=2C;product=2C;gbkey=Prot;ID=id-AAA50478.1:1112..1440
U05876.1 Genbank CDS 5071 5328 . + . Name=3A;product=3A;gbkey=Prot;ID=id-AAA50478.1:1441..1526
U05876.1 Genbank CDS 5329 5394 . + . Name=3B;product=3B;gbkey=Prot;ID=id-AAA50478.1:1527..1548
U05876.1 Genbank CDS 5395 5943 . + . Name=3C;product=3C;gbkey=Prot;ID=id-AAA50478.1:1549..1731
U05876.1 Genbank CDS 5944 7329 . + . Name=3D;product=3D;gbkey=Prot;ID=id-AAA50478.1:1732..2193
##sequence-region ancestral_sequence 1 7413
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31704
ancestral_sequence Genbank region 1 7413 . + . ID=U05876.1:1..7413;Dbxref=taxon:31704;gb-acronym=CV-A16;gbkey=Src;mol_type=genomic RNA;nat-host=Homo sapiens;strain=G-10
ancestral_sequence Genbank CDS 751 957 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAA50478.1:1..69
ancestral_sequence Genbank CDS 958 1719 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAA50478.1:70..323
ancestral_sequence Genbank CDS 1720 2445 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAA50478.1:324..565
ancestral_sequence Genbank CDS 2446 3336 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAA50478.1:566..862
ancestral_sequence Genbank CDS 3337 3786 . + . Name=2A;product=2A;gbkey=Prot;ID=id-AAA50478.1:863..1012
ancestral_sequence Genbank CDS 3787 4083 . + . Name=2B;product=2B;gbkey=Prot;ID=id-AAA50478.1:1013..1111
ancestral_sequence Genbank CDS 4084 5070 . + . Name=2C;product=2C;gbkey=Prot;ID=id-AAA50478.1:1112..1440
ancestral_sequence Genbank CDS 5071 5328 . + . Name=3A;product=3A;gbkey=Prot;ID=id-AAA50478.1:1441..1526
ancestral_sequence Genbank CDS 5329 5394 . + . Name=3B;product=3B;gbkey=Prot;ID=id-AAA50478.1:1527..1548
ancestral_sequence Genbank CDS 5395 5943 . + . Name=3C;product=3C;gbkey=Prot;ID=id-AAA50478.1:1549..1731
ancestral_sequence Genbank CDS 5944 7329 . + . Name=3D;product=3D;gbkey=Prot;ID=id-AAA50478.1:1732..2193

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this step really necessary?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not necessary. I mean it would be nice to have for consistency (and also in pathogen.json), but there are dozens of datasets which have these values all over the place. Don't bother. Hope users will undesstand. Might add an automated check later.

Loading