Update PanSN specification to use full string for `haplotype_id` #4

AndreaGuarracino · 2024-10-13T15:32:30Z

Following discussions within the HPRC (Human Pangenome Reference Consortium), we propose updating the PanSN specification to use full strings instead of integers for haplotype identifiers (haplotype_id). This change addresses several needs identified by consortium members:

Clear distinction between different assembly types (primary/alternate, haplotype 1/2, maternal/paternal, merged)
Support for various assembly scenarios (e.g., hifiasm/verkko with different phasing methods)
Improved human readability and immediate understanding of haplotype nature/origin

This change may require updates to tools that strictly adhere to the previous integer-based specification, such as those working with GFA W lines or VCFs.

AndreaGuarracino · 2024-10-13T15:44:06Z

Related PR on W-line specification in GFA file format: GFA-spec/GFA-spec#126

adamnovak · 2024-10-14T15:34:16Z

README.md

+- **Primary/Alternate assemblies**. Used when one assembly is much more complete than the other haplotype (example: `hifiasm` without Hi-C or trio phasing): `HG002#1#ctg1234` and `HG002#2#ctg5678`
+
+- **Haplotype 1/Haplotype 2**. Used when both haplotype assemblies are comparably complete, but parental origin is unknown (example: `hifiasm` or `verkko` with Hi-C phasing): `HG002#hap1#ctg1234` and `HG002#hap2#ctg5678`
+
+- **Maternal/Paternal**. Used when both haplotypes are comparably complete, and their parental origins for all chromosomes are known (example: `hifiasm` or `verkko` with trio phasing): `HG002#mat#ctg1234` and `HG002#pat#ctg5678`
+
+- **Merged assemblies**. Used when both haplotypes are combined to create one merged assembly (example: many past assemblers): `HG002#mer#ctg1234`


Right now in the pangenome graphs for GRCh38 and CHM13 we're using 0 (GRCh38#0#chr1) for the only haplotype of a haploid reference or assembly. Should that become a recommended practice? The closest thing here is mer, but I don't think it makes sense to say that CHM13 is merging anything.

adamnovak · 2024-10-14T15:49:18Z

README.md

@@ -32,6 +32,18 @@ Tools supporting PanSN should allow the user to change the delimiter.

 The prefixing should provide a unique hierachy of sample names and haplotype identifiers for the entire pangenome under analysis.

+### haplotype_id
+
+The use of strings for `haplotype_id` allows for clear representation of various assembly types and scenarios. Here are detailed examples based on common labeling practices from Vertebrate Genome Project (VGP) and Earth BioGenome Project (EBP):


How do we square this table of haplotype name meanings with the paragraph later on about how PanSN is not a metadata format? Because this ends up looking like we're defining a metadata format where you pick the right haplotype name to encode your metadata, and if you start having new metadata to encode (this is a decoy, this is the third copy in a trisomy) you might want to come back here and register a new name for it.

Maybe this whole section needs to be clearly a MAY. If a tool consuming PanSN-named assemblies actually needs to know which assembly came from which parent, it MUST NOT require that this will be encoded as mat and pat names. It MUST support a way for the user to specify that the egg genome is 1 and the sperm genome is 2, or even that the egg genome is pat and the sperm genome is mat.

AndreaGuarracino added 3 commits October 13, 2024 10:27

haplotype_id as full string

3aabf4e

provide examples from VGP and EBP

c8b11ae

fix typo

b64b9a7

AndreaGuarracino requested a review from ekg October 13, 2024 15:42

adamnovak reviewed Oct 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update PanSN specification to use full string for `haplotype_id` #4

Update PanSN specification to use full string for `haplotype_id` #4

AndreaGuarracino commented Oct 13, 2024

AndreaGuarracino commented Oct 13, 2024

adamnovak Oct 14, 2024

adamnovak Oct 14, 2024

Update PanSN specification to use full string for haplotype_id #4

Are you sure you want to change the base?

Update PanSN specification to use full string for haplotype_id #4

Conversation

AndreaGuarracino commented Oct 13, 2024

AndreaGuarracino commented Oct 13, 2024

adamnovak Oct 14, 2024

Choose a reason for hiding this comment

adamnovak Oct 14, 2024

Choose a reason for hiding this comment

Update PanSN specification to use full string for `haplotype_id` #4

Update PanSN specification to use full string for `haplotype_id` #4