Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PanSN specification to use full string for haplotype_id #4

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

AndreaGuarracino
Copy link
Member

Following discussions within the HPRC (Human Pangenome Reference Consortium), we propose updating the PanSN specification to use full strings instead of integers for haplotype identifiers (haplotype_id). This change addresses several needs identified by consortium members:

  • Clear distinction between different assembly types (primary/alternate, haplotype 1/2, maternal/paternal, merged)
  • Support for various assembly scenarios (e.g., hifiasm/verkko with different phasing methods)
  • Improved human readability and immediate understanding of haplotype nature/origin

This change may require updates to tools that strictly adhere to the previous integer-based specification, such as those working with GFA W lines or VCFs.

@AndreaGuarracino AndreaGuarracino requested a review from ekg October 13, 2024 15:42
@AndreaGuarracino
Copy link
Member Author

Related PR on W-line specification in GFA file format: GFA-spec/GFA-spec#126

Comment on lines +39 to +45
- **Primary/Alternate assemblies**. Used when one assembly is much more complete than the other haplotype (example: `hifiasm` without Hi-C or trio phasing): `HG002#1#ctg1234` and `HG002#2#ctg5678`

- **Haplotype 1/Haplotype 2**. Used when both haplotype assemblies are comparably complete, but parental origin is unknown (example: `hifiasm` or `verkko` with Hi-C phasing): `HG002#hap1#ctg1234` and `HG002#hap2#ctg5678`

- **Maternal/Paternal**. Used when both haplotypes are comparably complete, and their parental origins for all chromosomes are known (example: `hifiasm` or `verkko` with trio phasing): `HG002#mat#ctg1234` and `HG002#pat#ctg5678`

- **Merged assemblies**. Used when both haplotypes are combined to create one merged assembly (example: many past assemblers): `HG002#mer#ctg1234`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now in the pangenome graphs for GRCh38 and CHM13 we're using 0 (GRCh38#0#chr1) for the only haplotype of a haploid reference or assembly. Should that become a recommended practice? The closest thing here is mer, but I don't think it makes sense to say that CHM13 is merging anything.

@@ -32,6 +32,18 @@ Tools supporting PanSN should allow the user to change the delimiter.

The prefixing should provide a unique hierachy of sample names and haplotype identifiers for the entire pangenome under analysis.

### haplotype_id

The use of strings for `haplotype_id` allows for clear representation of various assembly types and scenarios. Here are detailed examples based on common labeling practices from Vertebrate Genome Project (VGP) and Earth BioGenome Project (EBP):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we square this table of haplotype name meanings with the paragraph later on about how PanSN is not a metadata format? Because this ends up looking like we're defining a metadata format where you pick the right haplotype name to encode your metadata, and if you start having new metadata to encode (this is a decoy, this is the third copy in a trisomy) you might want to come back here and register a new name for it.

Maybe this whole section needs to be clearly a MAY. If a tool consuming PanSN-named assemblies actually needs to know which assembly came from which parent, it MUST NOT require that this will be encoded as mat and pat names. It MUST support a way for the user to specify that the egg genome is 1 and the sperm genome is 2, or even that the egg genome is pat and the sperm genome is mat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants