Feature Request: Multi-database support options for `sourmash gather` #536

tnmquann · 2024-12-01T09:04:33Z

I’m exploring options to enhance the comprehensiveness of microbial profiling using sourmash gather/multigather. Specifically, I’m curious if it’s possible to use multiple databases simultaneously in a single command. The branchwater plugin documentation doesn’t mention this feature, but it would be invaluable for achieving a more holistic view of all microorganisms in a sample. This could also benefit users with limited storage or computational capacity who cannot retain all raw sequences for re-sketching later.

For example:

sourmash scripts fastmultigather ./manysketch.zip /sourmash_databases/*.zip -c 20 -k 51 -o /manysketch/sourmash_gather.csv

Additionally, I have a couple of feature suggestions:

Is there a way to merge multiple pre-sketch databases (e.g., GTDB + Genbank viral + Genbank protozoa + Genbank fungi) into a single large database? I followed the steps in Error when use fastmultigather against rocksdb (Error: No such file or directory (os error 2) - Tested with multiple cases) #381 to create a *.zip database by unzipping and merging, but this failed for some samples with similar hashes. I encountered files differentiated only by a number suffix, such as sig1.siz.gz, sig1.siz.gz_1, sig1.siz.gz_2, which complicates merging and management.
Would it be feasible to create a RocksDB that integrates multiple databases? This could streamline managing and querying large, multi-database sketches more efficiently.

Looking forward to hearing your thoughts on these ideas and any potential solutions!. Thanks!

The text was updated successfully, but these errors were encountered:

ctb · 2024-12-01T14:43:46Z

hi @tnmquann, agree! Please take a look at:

https://github.com/sourmash-bio/sourmash_plugin_branchwater/tree/main/doc#using-manifests-for-input-databases---why-and-when

In recent releases (v0.9.8 and beyond, with important bug fixes through v0.9.11, the current release) we added full support for standalone manifests, which should address most of your use cases above. You can now create a single manifest CSV with sourmash sig collect that points at multiple different files, including .sig.zip files. This would be the Right Way to combine multiple GenBank databases.

Please let us know how I can make this clearer in the documentation!

You can also use this to create a single RocksDB database from multiple inputs.

Unfortunately, at the moment there is no way to use multiple RocksDB databases as a search target without loading them all into memory (which defeats the purpose of them, yes). This is not something I think will improve in the next few releases.

tnmquann · 2024-12-04T09:34:26Z

Hi @ctb,

Sorry for the late response. I’ve been testing using a pathlist to create a general RocksDB. Everything is going smoothly except for a warning and a JSON load failure when using sourmash scripts index:

== This is sourmash version 4.8.11. ==  
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==  

ksize: 31 / scaled: None / moltype: DNA  
indexing all sketches in 'pathlist.txt'  
Loading sketches from pathlist.txt  
Reading analysis(s) from: 'pathlist.txt'  
FAILED to load as JSON files; falling back to general recursive  
Loaded 144633 analysis signature(s)  
Found 144633 sketches total.  
WARNING: loading all sketches into memory in order to index.  
See 'index' documentation for details.

Additionally, I’ve noticed issues when using sourmash scripts fastmultigather with a txt file containing database locations. It works when the paths point to .zip or .sig.gz files but fails when pointing to multiple RocksDBs or when mixing formats in the list. (Maybe I misunderstood something in this guide?)
Below are the three configurations I tested:

Working (with with JSON fallback warning):

/database/genbank2022_03/genbank-2022.03-fungi-k31.zip  
/database/genbank2022_03/genbank-2022.03-protozoa-k31.zip  
/database/genbank2022_03/genbank-2022.03-viral-k31.zip  
/database/gtdb-rs214/gtdb-rs214-reps.k31.zip

Not working:

/database/gtdb-rs214/gtdb-rs214-reps.k31.rocksdb  
/database/genbank2022_03/genbank-2022.03-viral-k31.rocksdb  
/database/genbank2022_03/genbank-2022.03-protozoa-k31.rocksdb  
/database/genbank2022_03/genbank-2022.03-fungi-k31.rocksdb

Also not working:

/database/gtdb-rs214/gtdb-rs214-reps.k31.rocksdb  
/database/genbank2022_03/genbank-2022.03-fungi-k31.zip  
/database/genbank2022_03/genbank-2022.03-protozoa-k31.zip  
/database/genbank2022_03/genbank-2022.03-viral-k31.zip

For now, loading all databases into memory is manageable as they don’t consume too much.
I’ll update if any other errors occur once the new RocksDB is fully indexed and I run it on the sample set. Thanks for your input!

ctb · 2024-12-04T12:54:09Z

I'll take a look - thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Multi-database support options for `sourmash gather` #536

Feature Request: Multi-database support options for `sourmash gather` #536

tnmquann commented Dec 1, 2024

ctb commented Dec 1, 2024

tnmquann commented Dec 4, 2024

ctb commented Dec 4, 2024

Feature Request: Multi-database support options for sourmash gather #536

Feature Request: Multi-database support options for sourmash gather #536

Comments

tnmquann commented Dec 1, 2024

ctb commented Dec 1, 2024

tnmquann commented Dec 4, 2024

ctb commented Dec 4, 2024

Feature Request: Multi-database support options for `sourmash gather` #536

Feature Request: Multi-database support options for `sourmash gather` #536