Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot round-trip explicitly set missing INFO values in VCF #1197

Closed
jeromekelleher opened this issue Feb 16, 2024 · 3 comments
Closed

Cannot round-trip explicitly set missing INFO values in VCF #1197

jeromekelleher opened this issue Feb 16, 2024 · 3 comments
Milestone

Comments

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Feb 16, 2024

The all_fields.vcf file contains lots of examples where we explicitly state that an INFO key is missing, rather than omitting the key, e.g. II1=. and II2=.,. here. This was handled before #1190 because we treating non-present INFO keys as PAD values and only these explicit "key=." values as missing.

I don't think it's a useful distinction, and likely to cause more problems downstream if we distinguish between these two types of missingness. I'm fairly clear that regarding missing keys as dimension padding isn't helpful, in any case.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
1       1       .       G       A,C     .       PASS    IB0     .       .       .
1       2       .       A       G,G     .       PASS    II1=126 .       .       .
1       3       .       A       G,G     .       PASS    II1=.   .       .       .
1       4       .       T       A,C     .       PASS    II2=459,-140    .       .       .
1       5       .       T       A,C     .       PASS    II2=.,-140      .       .       .
1       6       .       T       A,C     .       PASS    II2=459,.       .       .       .
1       7       .       T       A,C     .       PASS    II2=.,. .       .       .

However, it seems that bcftools at least does make this distinction, and losslessly roundtrips this VCF through BCF.

My suggestion here is that we just edit the all_fields.vcf file to remove all-missing values. This seems like a pretty niche problem, and probably something we'd need to deal with explicitly at the spec level rather than here. It's not worth getting bogged down on, I think.

@jeromekelleher jeromekelleher added this to the 0.8.0 release milestone Feb 16, 2024
@tomwhite
Copy link
Collaborator

Sounds good - all_fields.vcf is not a VCF from the wild, so it's OK to change it.

@jeromekelleher
Copy link
Collaborator Author

Do you have the script for generating it?

@tomwhite
Copy link
Collaborator

See https://github.com/pystatgen/sgkit/blob/main/sgkit/tests/io/vcf/test_vcf_generator.py

jeromekelleher added a commit to jeromekelleher/sgkit that referenced this issue Feb 16, 2024
jeromekelleher added a commit to jeromekelleher/sgkit that referenced this issue Feb 16, 2024
jeromekelleher added a commit to jeromekelleher/sgkit that referenced this issue Mar 5, 2024
@mergify mergify bot closed this as completed in ec365f0 Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants