Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute local alleles during encode #299

Merged
merged 12 commits into from
Jan 17, 2025

Conversation

jeromekelleher
Copy link
Contributor

We currently perform the calculations required to perform local allele conversion during the explode step. This was done initially as it was the most straightforward way to determine the final sizes of the Zarr arrays, when the number of local alleles was uncertain. However, as we see in #277, there's no real point in doing local alleles unless we cap the number of alt alleles to something reasonable. In practise, in all cases where it's currently used, the number of ALT alleles per sample is capped at 2.

I think it would be a lot simpler (and better) to do localisation as part of the encode step. Here's my first steps at implementing this. Basically, --local-alleles becomes an argument to schema generation, and any known fields (LAA, LPL) will simply be present in the schema with vcf_field set to null if they are to be computed. If those fields are present in the VCF itself, they'll just be converted as normal.

Actual generation of the fields then is done in a way that's analagous to how we handle genotypes during encode.

@Will-Tyler, would you mind taking a quick look at this initial sketch to see what you think? (I haven't taken out any of the code from explode yet)

@coveralls
Copy link
Collaborator

coveralls commented Jan 9, 2025

Coverage Status

coverage: 98.851% (+0.02%) from 98.835%
when pulling 45db77b on jeromekelleher:local-alleles-finalise
into bb380eb on sgkit-dev:main.

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 97.082% (-1.8%) from 98.87%
when pulling 8c28169 on jeromekelleher:local-alleles-finalise
into 883a37e on sgkit-dev:main.

Copy link
Contributor

@Will-Tyler Will-Tyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. I'm happy that you were able to simplify the LAA computation logic. I'm not as familiar with the encode subcommand, but the local alleles logic looks correct to me.

Comment on lines +222 to +236
pl.shape = (*shape, 3)
pl.chunks = (*chunks, 3)
pl.description += " (local-alleles)"
# TODO fix dimensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third dimension depends on the ploidy but should be 6 for diploid samples. I assume that is what the TODO refers to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below

laa = ZarrArraySpec.new(
vcf_field=None,
name="call_LAA",
dtype="i1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting "i4" based on the LAA computation code. Why do we use "i1"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contains allele indexes, so if it's more than 1 byte to store these we're in trouble (should check, though)

@jeromekelleher
Copy link
Contributor Author

I've put quite a bit of work into this now and I think it's nearly ready to merge and release in an "experimental" state. I've ended up deviating a bit from the VCF spec by using the LA field instead of LAA, as this allows us to differentiate between cases where refs values are referenced and not. If we don't do this and we treat REF specially then things get a lot more complicated. Doing it this way lets us specifically track only the (max) two alleles that are referenced in the genotypes, and therefore the PL dimension is 3.

Since the only actual implementation that we know about also uses LA instead of LAA, I think this is fine. Until we have an --delocalise-alleles option in vcztools this is all quite hard to roundtrip test anyway and we can just say that it's experimental. This implementation does massively reduce the space taken up by PL and AD fields, though.

The tests depended on complex semantics of mixed local and non local
fields, which doesn't seem worth the effort.

Fixup du test
@jeromekelleher jeromekelleher marked this pull request as ready for review January 17, 2025 12:11
@jeromekelleher
Copy link
Contributor Author

I think this is basically ready to go, I'll open some issues to track the rougher edges that'll need fixing later on.

@jeromekelleher
Copy link
Contributor Author

I'm going to go ahead and merge this as we want to use this code for some examples in the paper.

@jeromekelleher jeromekelleher merged commit 571cc21 into sgkit-dev:main Jan 17, 2025
16 checks passed
@jeromekelleher jeromekelleher deleted the local-alleles-finalise branch January 17, 2025 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants