-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] Teach hail to export a VDS to an SVCR-VCF #14743
Conversation
hail/python/hail/vds/impex.py
Outdated
dataset : :class:`.MatrixTable` | ||
Dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a VariantDataset
?
hail/python/hail/vds/impex.py
Outdated
ref = ref.drop('END') | ||
|
||
if 'gvcf_info' in var.entry and isinstance(var.gvcf_info.dtype, tstruct): | ||
warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to warn here? You've documented this above and given an example for how to preserve this information.
I view warnings as a way of telling users about changing behaviour or when they've been a bit naughty (but not terrible) and this doesn't really fall into that. What's your view?
if 'LGT' in var.entry: | ||
if 'GT' not in var.entry: | ||
var = var.annotate_entries(GT=lgt_to_gt(var.LGT, var.LA)) | ||
var = var.drop('LGT') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on dropping these fields if GT
/PT
are present? Is that safe to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's a data error if lgt_to_gt(var.LGT, var.LA) != var.GT
.
I'm thinking about writing the reverse (gt_to_lgt
). Not entirely sure why one would use it, which is what's stopping me.
_create_row_uids=bool, | ||
_create_col_uids=bool, | ||
) | ||
def import_vcf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to call this import_svcr
to make it distinct from the extant import_vcf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But I like the symmetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think import_vcf
is actually more correct. We use "svcr" to refer to the logical representation, of which "vds" is our native implementation. The new vcf standard now supports an svcr-like representation, which is what this imports from.
hail/python/hail/vds/impex.py
Outdated
---- | ||
The following validations and transformations take place: | ||
|
||
#. The ``LEN`` FORMAT field must exist and be of type :py:data:`.tint32` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the types listed in FORMAT
and INFO
sections be vcf types rather than hail types?
hail/python/hail/vds/impex.py
Outdated
The following validations and transformations take place: | ||
|
||
#. The ``LEN`` FORMAT field must exist and be of type :py:data:`.tint32` | ||
#. One of ``GT`` or ``LGT`` must be a FORMAT field of type :py:data:`.tcall` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should LGT
exist given you rename/drop it in export?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to preclude it from existing, it's well formed to have it.
hail/python/hail/vds/impex.py
Outdated
Parameters | ||
---------- | ||
path : :class:`str` or :obj:`list` of :obj:`str` | ||
One or more paths to VCF files to read. Each path may or may not include glob expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is it lol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will fix both this and import_vcf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a review on the documentation for new public-facing functionality. Your implementation looks good. It's nice how much existing functionality you could use.
Usage and notes in the documentation for the new methods.
ba167ab
to
67602a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for making the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, nice and clean! Just have a few simple questions and typos.
if 'END' in ref.entry: | ||
ref = ref.drop('END') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget, does VDS now have a manditory LEN
field, and optional END
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
'information."\n' | ||
) | ||
if VariantDataset.ref_block_max_length_field in ref.globals: | ||
rbml = hl.eval(ref[VariantDataset.ref_block_max_length_field]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially be bad if the VDS isn't coming directly from disk, though we'd hopefully be able to optimize this read of a global field to not do much work. Do you think we should recommend writing before exporting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary. We do optimize away the entire computation if we can simply read one global field. We're not that bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain that's always guaranteed (what if we add a global field as the result of an aggregation?), but I think you're right that it would be rare enough we should scare people into writing when it isn't necessary, especially in this case when we can be pretty certain the max length field comes from disk.
_create_row_uids=bool, | ||
_create_col_uids=bool, | ||
) | ||
def import_vcf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think import_vcf
is actually more correct. We use "svcr" to refer to the logical representation, of which "vds" is our native implementation. The new vcf standard now supports an svcr-like representation, which is what this imports from.
assert 'LAA' in format_md, format_md | ||
assert 'LA' not in format_md, format_md | ||
assert 'gvcf_info' not in format_md, format_md | ||
# LPGT is in this VDS, so we can check for PGT existing and LPGT not existing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment backwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, but it's clearly unclear
Co-authored-by: Patrick Schultz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Chris!
Usage and notes in the documentation for the new methods.
Part of #14655.
Security Assessment
Impact Description
Does not increase attack surface. New methods only call into old ones.