-
Notifications
You must be signed in to change notification settings - Fork 3
BGEN reader implementation using bgen_reader #1
Conversation
Make coverage 100%. Add GH Action to run test and build.
This should be ready to go in now, following the addition of |
self.partition_size = mf.partition_size | ||
|
||
df = mf.create_variants() | ||
if persist: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Quite cool that bgen reader is already giving back dask frames!
Looks good @tomwhite! I had a few questions but my biggest one would be whether or not it's possible to create the dosage array based on the dask delayed instances bgen_reader looks to be creating already. In other words, I'm wondering if you couldn't take the list of delayed's containing the dict with probs = da.block([
[
da.from_delayed(bgen['genotype'][row_slice]['probs'][col_slice])
for col_slice in col_slices
]
for row_slice in row_slices
])
dosage = 2 * probs[:, :, 2] + probs[:, :, 1] # or something like this I don't fully understand how |
Thanks for the thorough review @eric-czech. On your main point, using the dask delayed instances that bgen_reader is already creating would be ideal, but as it is written each delayed object wraps one row (https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46), which I don't think is very efficient for bulk access. This is why I had to reach in to some internal methods to find the virtual addresses for each chunk, and then read those in one go. I'll address the other feedback with an updated PR. |
Ah I see, that's a bummer! |
Can we file an issue upstream to optimize the bulk access use case? |
Addressed all the feedback from @eric-czech |
Thanks @tomwhite, looks fantastic! |
Add a BGEN reader implementation which is chunked using
dask.array.from_array
. The variant metadata is also chunked using the in-built Dask support in bgen_reader.This is not ready to be merged since it relies on changes that are not in sgkit to add data representation for dosages (see https://github.com/tomwhite/sgkit/tree/dosages). There is also discussion in https://github.com/pystatgen/sgkit/issues/21