Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run classification step in chunks in case of large number of ASVs #827

Open
jtangrot opened this issue Jan 24, 2025 · 2 comments
Open

Run classification step in chunks in case of large number of ASVs #827

jtangrot opened this issue Jan 24, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@jtangrot
Copy link
Contributor

Description of feature

As discussed in slack, the classification step (in particular DADA2_ADDSPECIES) use excessive amounts of RAM in cases when there are lots of ASVs generated, e.g. in the order of 60-100 000 ASVs. A suggested solution is to run the classification step in batches by splitting the ASV_seqs.fasta file, and then merge the resulting taxonomy files. It would be great if it would be possible to implemented that in the pipeline instead of having to run it manually.

@jtangrot jtangrot added the enhancement New feature or request label Jan 24, 2025
@d4straub
Copy link
Collaborator

Thanks!
Chunking can be done as described in https://nextflow-io.github.io/patterns/process-per-file-chunk/ and files can be collected afterwards and merged again. Should not be too complicated to implement. One parameter for the chunk size might be good so that one can split to the desired size and have by default probably 10k or so.
[unfortunately I dont see currently a time window where I could do that]

@jtangrot
Copy link
Contributor Author

Unfortunately I have no time to spend on this myself at the moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants