-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute local alleles during encode #299
Compute local alleles during encode #299
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good. I'm happy that you were able to simplify the LAA computation logic. I'm not as familiar with the encode subcommand, but the local alleles logic looks correct to me.
pl.shape = (*shape, 3) | ||
pl.chunks = (*chunks, 3) | ||
pl.description += " (local-alleles)" | ||
# TODO fix dimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The third dimension depends on the ploidy but should be 6 for diploid samples. I assume that is what the TODO refers to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below
laa = ZarrArraySpec.new( | ||
vcf_field=None, | ||
name="call_LAA", | ||
dtype="i1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting "i4"
based on the LAA computation code. Why do we use "i1"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This contains allele indexes, so if it's more than 1 byte to store these we're in trouble (should check, though)
b407ed5
to
49fc182
Compare
I've put quite a bit of work into this now and I think it's nearly ready to merge and release in an "experimental" state. I've ended up deviating a bit from the VCF spec by using the LA field instead of LAA, as this allows us to differentiate between cases where refs values are referenced and not. If we don't do this and we treat REF specially then things get a lot more complicated. Doing it this way lets us specifically track only the (max) two alleles that are referenced in the genotypes, and therefore the PL dimension is 3. Since the only actual implementation that we know about also uses LA instead of LAA, I think this is fine. Until we have an |
Fixup Close LAA loophole
abed893
to
cf1947f
Compare
The tests depended on complex semantics of mixed local and non local fields, which doesn't seem worth the effort. Fixup du test
I think this is basically ready to go, I'll open some issues to track the rougher edges that'll need fixing later on. |
I'm going to go ahead and merge this as we want to use this code for some examples in the paper. |
We currently perform the calculations required to perform local allele conversion during the
explode
step. This was done initially as it was the most straightforward way to determine the final sizes of the Zarr arrays, when the number of local alleles was uncertain. However, as we see in #277, there's no real point in doing local alleles unless we cap the number of alt alleles to something reasonable. In practise, in all cases where it's currently used, the number of ALT alleles per sample is capped at 2.I think it would be a lot simpler (and better) to do localisation as part of the encode step. Here's my first steps at implementing this. Basically,
--local-alleles
becomes an argument to schema generation, and any known fields (LAA, LPL) will simply be present in the schema withvcf_field
set tonull
if they are to be computed. If those fields are present in the VCF itself, they'll just be converted as normal.Actual generation of the fields then is done in a way that's analagous to how we handle genotypes during
encode
.@Will-Tyler, would you mind taking a quick look at this initial sketch to see what you think? (I haven't taken out any of the code from
explode
yet)