Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Teach hail to export a VDS to an SVCR-VCF #14743

Merged
merged 7 commits into from
Oct 30, 2024

Conversation

chrisvittal
Copy link
Collaborator

Usage and notes in the documentation for the new methods.

Part of #14655.

Security Assessment

  • This change has no security impact

Impact Description

Does not increase attack surface. New methods only call into old ones.

Comment on lines 71 to 72
dataset : :class:`.MatrixTable`
Dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a VariantDataset?

ref = ref.drop('END')

if 'gvcf_info' in var.entry and isinstance(var.gvcf_info.dtype, tstruct):
warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to warn here? You've documented this above and given an example for how to preserve this information.

I view warnings as a way of telling users about changing behaviour or when they've been a bit naughty (but not terrible) and this doesn't really fall into that. What's your view?

if 'LGT' in var.entry:
if 'GT' not in var.entry:
var = var.annotate_entries(GT=lgt_to_gt(var.LGT, var.LA))
var = var.drop('LGT')
Copy link
Member

@ehigham ehigham Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on dropping these fields if GT/PT are present? Is that safe to do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's a data error if lgt_to_gt(var.LGT, var.LA) != var.GT.

I'm thinking about writing the reverse (gt_to_lgt). Not entirely sure why one would use it, which is what's stopping me.

_create_row_uids=bool,
_create_col_uids=bool,
)
def import_vcf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to call this import_svcr to make it distinct from the extant import_vcf?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I like the symmetry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think import_vcf is actually more correct. We use "svcr" to refer to the logical representation, of which "vds" is our native implementation. The new vcf standard now supports an svcr-like representation, which is what this imports from.

----
The following validations and transformations take place:

#. The ``LEN`` FORMAT field must exist and be of type :py:data:`.tint32`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the types listed in FORMAT and INFO sections be vcf types rather than hail types?

The following validations and transformations take place:

#. The ``LEN`` FORMAT field must exist and be of type :py:data:`.tint32`
#. One of ``GT`` or ``LGT`` must be a FORMAT field of type :py:data:`.tcall`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should LGT exist given you rename/drop it in export?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to preclude it from existing, it's well formed to have it.

Parameters
----------
path : :class:`str` or :obj:`list` of :obj:`str`
One or more paths to VCF files to read. Each path may or may not include glob expressions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is it lol?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will fix both this and import_vcf

Copy link
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a review on the documentation for new public-facing functionality. Your implementation looks good. It's nice how much existing functionality you could use.

Copy link
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for making the changes.

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, nice and clean! Just have a few simple questions and typos.

Comment on lines +95 to +96
if 'END' in ref.entry:
ref = ref.drop('END')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget, does VDS now have a manditory LEN field, and optional END field?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

'information."\n'
)
if VariantDataset.ref_block_max_length_field in ref.globals:
rbml = hl.eval(ref[VariantDataset.ref_block_max_length_field])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially be bad if the VDS isn't coming directly from disk, though we'd hopefully be able to optimize this read of a global field to not do much work. Do you think we should recommend writing before exporting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary. We do optimize away the entire computation if we can simply read one global field. We're not that bad.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain that's always guaranteed (what if we add a global field as the result of an aggregation?), but I think you're right that it would be rare enough we should scare people into writing when it isn't necessary, especially in this case when we can be pretty certain the max length field comes from disk.

hail/python/hail/vds/impex.py Outdated Show resolved Hide resolved
_create_row_uids=bool,
_create_col_uids=bool,
)
def import_vcf(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think import_vcf is actually more correct. We use "svcr" to refer to the logical representation, of which "vds" is our native implementation. The new vcf standard now supports an svcr-like representation, which is what this imports from.

assert 'LAA' in format_md, format_md
assert 'LA' not in format_md, format_md
assert 'gvcf_info' not in format_md, format_md
# LPGT is in this VDS, so we can check for PGT existing and LPGT not existing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment backwards?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but it's clearly unclear

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris!

@hail-ci-robot hail-ci-robot merged commit 056d9f2 into hail-is:main Oct 30, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants