Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for reading from standard input? #29

Open
mikemc opened this issue Sep 24, 2020 · 3 comments
Open

Support for reading from standard input? #29

mikemc opened this issue Sep 24, 2020 · 3 comments

Comments

@mikemc
Copy link

mikemc commented Sep 24, 2020

My understanding is that currently BURST can only take in reads as an unzipped Fasta file. I am wondering if taking reads from standard input might be implemented at some point. Here are two use cases where I would find it helpful to be able to pass the reads in from standard input:

  1. My sequence data is stored in Gzipped Fastq files, but BURST wants unzipped Fasta files. It would be handy if I could simply pipe the output from a tool like seqtk (which can convert to unzipped Fasta format) into BURST, rather than having to save the intermediate Fasta files, which I don't need for any other purpose. In this way, allowing standard input partially gets around BURST's atypically need for reads as unzipped Fasta files.

  2. I am testing mapping pipelines on simulated reads. It would be handy sometimes to be able to pipe the simulated reads right into BURST.

@GabeAl
Copy link
Collaborator

GabeAl commented Sep 29, 2020

Hi mikemc,

It is (unofficially) supported. :) You can pipe fasta format into BURST, but be sure to use "/dev/stdin" as the input filename. This feature hasn't been tested much.

Thanks,
Gabe

@GabeAl
Copy link
Collaborator

GabeAl commented Sep 29, 2020

One more comment -- BURST is designed to map multiple samples of shallow shotgun sequencing simultaneously, rather than one sample at a time. This is the reason for its fasta format requirement. Preprocessing is expected to take place with SHI7 (my QC toolkit), as this toolkit multiplexes multiple samples into a single "combined_seqs.fna" fasta file, much like QIIME1.

In this way, there is little distinction between amplicon and shallow shotgun sequencing -- both can be preprocessed by SHI7 in the same way, multiplexed in the same way, and aligned all together with BURST. BURST performs faster, and its capitalist algorithm performs better, when there are more queries in the same input file (scaling sublinearly with sequence depth, provided there is enough RAM to house the entire combined_seqs.fna).

It is not the expectation that users simply convert existing fastq's to fasta format and run alignment (although doing so would still work).

The expected pipeline is:

  1. SHI7 to produce the QC'd, adaptor-trimmed, stitched (potentially) or appended R1/R2 if paired-end was used, in a single combined_seqs.fna file that includes all samples with sample names delimited by underscore
  2. BURST to align the single combined_seqs.fna file in one run
  3. embalmulate on the resulting .b6 alignment file to produce the final table of counts (which it will split per sample again based on sample name).

Hope this clarifies!

@mikemc
Copy link
Author

mikemc commented Sep 29, 2020

Thanks Gabe, both clarifications are very helpful. I have been BURST as a point of comparison for BBMap/BBSplit in a BBTools-based pipeline, so haven't bothered with SHI7 but will look into it for future use, especially if I were to use Capitalist. (Allpaths and Forage modes are most useful for my current needs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants