Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can bgzip multi-threading be combined with random access reads? #139

Open
zdk123 opened this issue Dec 19, 2022 · 1 comment
Open

Can bgzip multi-threading be combined with random access reads? #139

zdk123 opened this issue Dec 19, 2022 · 1 comment
Labels

Comments

@zdk123
Copy link
Contributor

zdk123 commented Dec 19, 2022

Is there a general strategy for simultaneous random access to GZ blocks in an indexed gz file?

For large files, i know the virtual positions and buffer sizes i want to access and so when these are totally disjoint, it should be a nice speed-up to read these via multiple threads. Right now, this can't be done via the Indexed Reader, since the virtual position is a mutable property of the reader. However, I notice that contiguous blocks can be read multi-threaded so, in-principle, why not discontiguous blocks?

thanks!

@zaeleus zaeleus added the bgzf label Dec 20, 2022
@JShermanK1
Copy link

Using the sync_file crate, you can clone the file handle into a new reader for each thread. Each clone of the sync file will maintain an independent position for "multi-threaded reads". As I understand it, reading the bytes off the disk is still sequential, but the post disc reading decompression can then happen concurrently, which is the majority of the time spent anyways.

use sync_file::SyncFile;
use rayon::prelude::*;
use noodles::bcf;
use itertools::Itertools;

 let f = SyncFile::open(path)?;
 let header = {
        let mut bcf_r = bcf::Reader::new(f.clone());

        bcf_r.read_file_format()?;
        header = bcf_r.read_header()?

    }
  
let mut data = (0..).map_while(|i|{

        header.contigs().get_index(i)
        
                        }).collect_vec()
                        .into_par_iter()
                        .for_each(|chrom| {

        let mut bcf_r = { 
            
            let mut bcf_r = bcf::Reader::new(f.clone());
            bcf_r.read_file_format().expect("failed to read format");
            let _r_header = bcf_r.read_header().expect("failed to read header");

            bcf_r
        };
        let region = format!("{}", chrom.0).parse().expect("failed to parse region");
        let records = bcf_r.query(&header, &index, &region)
                                     .expect("failed to query index");
                                     
        records.for_each(|record| {
        
            do_stuff(record);
            
        });
});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants