Run classification step in chunks in case of large number of ASVs #827

jtangrot · 2025-01-24T07:49:47Z

Description of feature

As discussed in slack, the classification step (in particular DADA2_ADDSPECIES) use excessive amounts of RAM in cases when there are lots of ASVs generated, e.g. in the order of 60-100 000 ASVs. A suggested solution is to run the classification step in batches by splitting the ASV_seqs.fasta file, and then merge the resulting taxonomy files. It would be great if it would be possible to implemented that in the pipeline instead of having to run it manually.

d4straub · 2025-01-24T08:06:50Z

Thanks!
Chunking can be done as described in https://nextflow-io.github.io/patterns/process-per-file-chunk/ and files can be collected afterwards and merged again. Should not be too complicated to implement. One parameter for the chunk size might be good so that one can split to the desired size and have by default probably 10k or so.
[unfortunately I dont see currently a time window where I could do that]

jtangrot · 2025-01-24T08:11:13Z

Unfortunately I have no time to spend on this myself at the moment

jtangrot added the enhancement New feature or request label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run classification step in chunks in case of large number of ASVs #827

Run classification step in chunks in case of large number of ASVs #827

jtangrot commented Jan 24, 2025

d4straub commented Jan 24, 2025

jtangrot commented Jan 24, 2025

Run classification step in chunks in case of large number of ASVs #827

Run classification step in chunks in case of large number of ASVs #827

Comments

jtangrot commented Jan 24, 2025

Description of feature

d4straub commented Jan 24, 2025

jtangrot commented Jan 24, 2025