BGEN reader implementation using bgen_reader #1

tomwhite · 2020-07-08T12:12:11Z

Add a BGEN reader implementation which is chunked using dask.array.from_array. The variant metadata is also chunked using the in-built Dask support in bgen_reader.

This is not ready to be merged since it relies on changes that are not in sgkit to add data representation for dosages (see https://github.com/tomwhite/sgkit/tree/dosages). There is also discussion in https://github.com/pystatgen/sgkit/issues/21

Make coverage 100%. Add GH Action to run test and build.

tomwhite · 2020-07-21T09:39:10Z

This should be ready to go in now, following the addition of create_genotype_dosage_dataset in sgkit.

setup.cfg

sgkit_bgen/bgen_reader.py

sgkit_bgen/tests/test_bgen_reader.py

eric-czech · 2020-07-21T15:37:40Z

sgkit_bgen/bgen_reader.py

+            self.partition_size = mf.partition_size
+
+            df = mf.create_variants()
+            if persist:


👍

Quite cool that bgen reader is already giving back dask frames!

eric-czech · 2020-07-21T16:06:24Z

Looks good @tomwhite! I had a few questions but my biggest one would be whether or not it's possible to create the dosage array based on the dask delayed instances bgen_reader looks to be creating already. In other words, I'm wondering if you couldn't take the list of delayed's containing the dict with probs in something like https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L54 and wrap them up without needing from_array. Maybe something vaguely like:

probs = da.block([
  [ 
    da.from_delayed(bgen['genotype'][row_slice]['probs'][col_slice])
    for col_slice in col_slices
  ]
  for row_slice in row_slices
])
dosage = 2 * probs[:, :, 2] + probs[:, :, 1]  # or something like this

I don't fully understand how bgen_reader works but it looked like that might be a possibility.

tomwhite · 2020-07-23T12:20:07Z

Thanks for the thorough review @eric-czech. On your main point, using the dask delayed instances that bgen_reader is already creating would be ideal, but as it is written each delayed object wraps one row (https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46), which I don't think is very efficient for bulk access. This is why I had to reach in to some internal methods to find the virtual addresses for each chunk, and then read those in one go.

I'll address the other feedback with an updated PR.

eric-czech · 2020-07-23T13:17:33Z

as it is written each delayed object wraps one row

Ah I see, that's a bummer!

hammer · 2020-07-23T13:44:37Z

as it is written each delayed object wraps one row (https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46), which I don't think is very efficient for bulk access.

Can we file an issue upstream to optimize the bulk access use case?

tomwhite · 2020-07-24T13:49:16Z

Addressed all the feedback from @eric-czech

eric-czech · 2020-07-24T14:31:14Z

Thanks @tomwhite, looks fantastic!

tomwhite mentioned this pull request Jul 13, 2020

Add create_genotype_dosage_dataset sgkit-dev/sgkit#38

Merged

tomwhite added 2 commits July 21, 2020 10:31

BGEN reader implementation using bgen_reader

a014507

Use encode_array from sgkit.

3748617

Make coverage 100%. Add GH Action to run test and build.

tomwhite force-pushed the bgen-reader branch from 41b7d5d to 3748617 Compare July 21, 2020 09:32

tomwhite requested a review from eric-czech July 21, 2020 09:38