Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

BGEN reader implementation using bgen_reader #1

Merged
merged 8 commits into from
Jul 24, 2020

Conversation

tomwhite
Copy link
Collaborator

@tomwhite tomwhite commented Jul 8, 2020

Add a BGEN reader implementation which is chunked using dask.array.from_array. The variant metadata is also chunked using the in-built Dask support in bgen_reader.

This is not ready to be merged since it relies on changes that are not in sgkit to add data representation for dosages (see https://github.com/tomwhite/sgkit/tree/dosages). There is also discussion in https://github.com/pystatgen/sgkit/issues/21

tomwhite added 2 commits July 21, 2020 10:31
Make coverage 100%.
Add GH Action to run test and build.
@tomwhite
Copy link
Collaborator Author

This should be ready to go in now, following the addition of create_genotype_dosage_dataset in sgkit.

setup.cfg Show resolved Hide resolved
self.partition_size = mf.partition_size

df = mf.create_variants()
if persist:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Quite cool that bgen reader is already giving back dask frames!

@eric-czech
Copy link
Collaborator

eric-czech commented Jul 21, 2020

Looks good @tomwhite! I had a few questions but my biggest one would be whether or not it's possible to create the dosage array based on the dask delayed instances bgen_reader looks to be creating already. In other words, I'm wondering if you couldn't take the list of delayed's containing the dict with probs in something like https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L54 and wrap them up without needing from_array. Maybe something vaguely like:

probs = da.block([
  [ 
    da.from_delayed(bgen['genotype'][row_slice]['probs'][col_slice])
    for col_slice in col_slices
  ]
  for row_slice in row_slices
])
dosage = 2 * probs[:, :, 2] + probs[:, :, 1]  # or something like this

I don't fully understand how bgen_reader works but it looked like that might be a possibility.

@tomwhite
Copy link
Collaborator Author

Thanks for the thorough review @eric-czech. On your main point, using the dask delayed instances that bgen_reader is already creating would be ideal, but as it is written each delayed object wraps one row (https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46), which I don't think is very efficient for bulk access. This is why I had to reach in to some internal methods to find the virtual addresses for each chunk, and then read those in one go.

I'll address the other feedback with an updated PR.

@eric-czech
Copy link
Collaborator

as it is written each delayed object wraps one row

Ah I see, that's a bummer!

@hammer
Copy link

hammer commented Jul 23, 2020

as it is written each delayed object wraps one row (https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46), which I don't think is very efficient for bulk access.

Can we file an issue upstream to optimize the bulk access use case?

@tomwhite
Copy link
Collaborator Author

Addressed all the feedback from @eric-czech

@eric-czech
Copy link
Collaborator

Thanks @tomwhite, looks fantastic!

@tomwhite tomwhite merged commit 47a7db7 into sgkit-dev:master Jul 24, 2020
@tomwhite tomwhite deleted the bgen-reader branch July 27, 2020 08:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants