Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

Add option to specify samples and metafile location #6

Open
eric-czech opened this issue Aug 4, 2020 · 8 comments
Open

Add option to specify samples and metafile location #6

eric-czech opened this issue Aug 4, 2020 · 8 comments

Comments

@eric-czech
Copy link
Collaborator

It would be useful for the read_bgen function to have arguments for custom metafile and sample paths. This should pass down to the bgen_reader.read_bgen arguments for those things.

The pressing need I have for this is that bgen_reader is requiring filesystem properties that aren't supported by gcsfuse. Example:

path = osp.expanduser('~/data/rs-ukb/raw-data/gt-imputation/ukb_imp_chr21_v3.bgen')
read_bgen(path)
# OSError: [Errno 5] Input/output error
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
 in 
----> 1 ds = load(21)
      2 ds

in load(contig)
2 path = osp.join(input_path, f'ukb_imp_chr{contig}_v3.bgen')
3 print(path)
----> 4 ds = read_bgen(path)
5 return ds

~/repos/sgkit-bgen/sgkit_bgen/bgen_reader.py in read_bgen(path, chunks, lock, persist)
175 """
176
--> 177 bgen_reader = BgenReader(path, persist)
178
179 variant_contig, variant_contig_names = encode_array(bgen_reader.contig.compute())

~/repos/sgkit-bgen/sgkit_bgen/bgen_reader.py in init(self, path, persist, dtype)
46 self.path = Path(path)
47
---> 48 self.metafile_filepath = _infer_metafile_filepath(Path(self.path))
49 if not self.metafile_filepath.exists():
50 create_metafile(path, self.metafile_filepath, verbose=False)

~/miniconda3/envs/ukb-analysis/lib/python3.7/site-packages/bgen_reader/_reader.py in _infer_metafile_filepath(bgen_filepath)
148 return BGEN_READER_CACHE_HOME / "metafile" / path_to_filename(metafile)
149 else:
--> 150 if is_file_writable(metafile):
151 return metafile
152

~/miniconda3/envs/ukb-analysis/lib/python3.7/site-packages/bgen_reader/_file.py in is_file_writable(filepath)
41 def is_file_writable(filepath: Path):
42 try:
---> 43 _touch(filepath)
44 except PermissionError:
45 return False

~/miniconda3/envs/ukb-analysis/lib/python3.7/site-packages/bgen_reader/_file.py in _touch(filepath, mode, dir_fd, **kwargs)
86 f.fileno() if os.utime in os.supports_fd else filepath,
87 dir_fd=None if os.supports_fd else dir_fd,
---> 88 **kwargs,
89 )

OSError: [Errno 5] Input/output error

I imagine I can get around this temporarily by having the metafiles written to a local directory instead, but I'm not sure how we should do this in a distributed environment with remote storage.

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 4, 2020

We can certainly do that, but I wonder how gcsfuse is going to perform for large files anyway. This may be a reason to keep the PyBGEN reader around...

@eric-czech
Copy link
Collaborator Author

Is the metafile only a useful optimization after the file has been read in full at least once? I'm wondering if we should push upstream to have its use be optional. Most applications I can think of would only involve a few full scans of the data and no random access.

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 4, 2020

No, I think its use is crucial. Metafile construction is fairly cheap (cheaper than reading the whole file) and is needed to find the virtual offsets that allow the file to be read in parallel.

@eric-czech
Copy link
Collaborator Author

Ah gotcha. Do you have any suggestions on what our strategy should be for doing this on a cluster? I'm stumped as to where to go next with UKB.

On one hand it seems straightforward enough to simply copy certain files to local persistent disks but they're only broken up by chromosome so to get more than 25 reads in parallel I can see it getting a bit tricky to coordinate copies and reads potentially across chromosome boundaries.

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 4, 2020

Is there a BGEN file per chromosome? If so, one thing to try would be to copy the BGEN file to local persistent disk and run the conversion there (with multiple threads). This would give a baseline, and give an idea of how feasible it would be to convert each BGEN file on a separate cloud box (each with a hundred cores say).

Another approach would be to use PyBGEN (resurrecting the old code I wrote) and run it on a cluster against GCS directly. PyBGEN uses bgenix index files (not metafiles), which need to be generated before the BGEN files can be read. The bgenix utility is in C++ and doesn't support cloud stores. I think Glow might have support for that, or we could write something perhaps. It would be good to know how PyBGEN and bgen-reader compare peformance-wise before we do that though.

@eric-czech
Copy link
Collaborator Author

Is there a BGEN file per chromosome?

There is as well as an accompanying bgen index file.

Another approach would be to use PyBGEN (resurrecting the old code I wrote) and run it on a cluster against GCS directly

That's all here right?

It would be good to know how PyBGEN and bgen-reader compare peformance-wise before we do that though.

Makes sense. Given that the index files are there, presumably it would be pretty easy to figure out how much slower PyBGEN is for converting a smaller, local file.

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 4, 2020

That's all here right?

Correct, although it returns the old xarray representation.

Given that the index files are there, presumably it would be pretty easy to figure out how much slower PyBGEN is for converting a smaller, local file.

Good idea, maybe try on chr22?

@eric-czech
Copy link
Collaborator Author

Update: Copying the files locally has worked well enough and plays nicely with a typical snakemake-on-gke setup. Being able to specify the metafile and samples locations would still be helpful though, since UKB does not use the same name and it would provide some parity with sgkit-plink.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants