-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PanSN specification to use full string for haplotype_id
#4
base: main
Are you sure you want to change the base?
Conversation
Related PR on W-line specification in GFA file format: GFA-spec/GFA-spec#126 |
- **Primary/Alternate assemblies**. Used when one assembly is much more complete than the other haplotype (example: `hifiasm` without Hi-C or trio phasing): `HG002#1#ctg1234` and `HG002#2#ctg5678` | ||
|
||
- **Haplotype 1/Haplotype 2**. Used when both haplotype assemblies are comparably complete, but parental origin is unknown (example: `hifiasm` or `verkko` with Hi-C phasing): `HG002#hap1#ctg1234` and `HG002#hap2#ctg5678` | ||
|
||
- **Maternal/Paternal**. Used when both haplotypes are comparably complete, and their parental origins for all chromosomes are known (example: `hifiasm` or `verkko` with trio phasing): `HG002#mat#ctg1234` and `HG002#pat#ctg5678` | ||
|
||
- **Merged assemblies**. Used when both haplotypes are combined to create one merged assembly (example: many past assemblers): `HG002#mer#ctg1234` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now in the pangenome graphs for GRCh38 and CHM13 we're using 0
(GRCh38#0#chr1
) for the only haplotype of a haploid reference or assembly. Should that become a recommended practice? The closest thing here is mer
, but I don't think it makes sense to say that CHM13 is merging anything.
@@ -32,6 +32,18 @@ Tools supporting PanSN should allow the user to change the delimiter. | |||
|
|||
The prefixing should provide a unique hierachy of sample names and haplotype identifiers for the entire pangenome under analysis. | |||
|
|||
### haplotype_id | |||
|
|||
The use of strings for `haplotype_id` allows for clear representation of various assembly types and scenarios. Here are detailed examples based on common labeling practices from Vertebrate Genome Project (VGP) and Earth BioGenome Project (EBP): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we square this table of haplotype name meanings with the paragraph later on about how PanSN is not a metadata format? Because this ends up looking like we're defining a metadata format where you pick the right haplotype name to encode your metadata, and if you start having new metadata to encode (this is a decoy, this is the third copy in a trisomy) you might want to come back here and register a new name for it.
Maybe this whole section needs to be clearly a MAY. If a tool consuming PanSN-named assemblies actually needs to know which assembly came from which parent, it MUST NOT require that this will be encoded as mat
and pat
names. It MUST support a way for the user to specify that the egg genome is 1
and the sperm genome is 2
, or even that the egg genome is pat
and the sperm genome is mat
.
Following discussions within the HPRC (Human Pangenome Reference Consortium), we propose updating the PanSN specification to use full strings instead of integers for haplotype identifiers (
haplotype_id
). This change addresses several needs identified by consortium members:hifiasm
/verkko
with different phasing methods)This change may require updates to tools that strictly adhere to the previous integer-based specification, such as those working with GFA W lines or VCFs.