Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Multi-database support options for sourmash gather #536

Open
tnmquann opened this issue Dec 1, 2024 · 3 comments
Open

Comments

@tnmquann
Copy link

tnmquann commented Dec 1, 2024

Hi @ctb,

I’m exploring options to enhance the comprehensiveness of microbial profiling using sourmash gather/multigather. Specifically, I’m curious if it’s possible to use multiple databases simultaneously in a single command. The branchwater plugin documentation doesn’t mention this feature, but it would be invaluable for achieving a more holistic view of all microorganisms in a sample. This could also benefit users with limited storage or computational capacity who cannot retain all raw sequences for re-sketching later.

For example:

sourmash scripts fastmultigather ./manysketch.zip /sourmash_databases/*.zip -c 20 -k 51 -o /manysketch/sourmash_gather.csv

Additionally, I have a couple of feature suggestions:

  • Is there a way to merge multiple pre-sketch databases (e.g., GTDB + Genbank viral + Genbank protozoa + Genbank fungi) into a single large database? I followed the steps in Error when use fastmultigather against rocksdb (Error: No such file or directory (os error 2) - Tested with multiple cases) #381 to create a *.zip database by unzipping and merging, but this failed for some samples with similar hashes. I encountered files differentiated only by a number suffix, such as sig1.siz.gz, sig1.siz.gz_1, sig1.siz.gz_2, which complicates merging and management.
  • Would it be feasible to create a RocksDB that integrates multiple databases? This could streamline managing and querying large, multi-database sketches more efficiently.

Looking forward to hearing your thoughts on these ideas and any potential solutions!. Thanks!

@ctb
Copy link
Collaborator

ctb commented Dec 1, 2024

hi @tnmquann, agree! Please take a look at:

https://github.com/sourmash-bio/sourmash_plugin_branchwater/tree/main/doc#using-manifests-for-input-databases---why-and-when

In recent releases (v0.9.8 and beyond, with important bug fixes through v0.9.11, the current release) we added full support for standalone manifests, which should address most of your use cases above. You can now create a single manifest CSV with sourmash sig collect that points at multiple different files, including .sig.zip files. This would be the Right Way to combine multiple GenBank databases.

Please let us know how I can make this clearer in the documentation!

You can also use this to create a single RocksDB database from multiple inputs.

Unfortunately, at the moment there is no way to use multiple RocksDB databases as a search target without loading them all into memory (which defeats the purpose of them, yes). This is not something I think will improve in the next few releases.

@tnmquann
Copy link
Author

tnmquann commented Dec 4, 2024

Hi @ctb,

Sorry for the late response. I’ve been testing using a pathlist to create a general RocksDB. Everything is going smoothly except for a warning and a JSON load failure when using sourmash scripts index:

== This is sourmash version 4.8.11. ==  
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==  

ksize: 31 / scaled: None / moltype: DNA  
indexing all sketches in 'pathlist.txt'  
Loading sketches from pathlist.txt  
Reading analysis(s) from: 'pathlist.txt'  
FAILED to load as JSON files; falling back to general recursive  
Loaded 144633 analysis signature(s)  
Found 144633 sketches total.  
WARNING: loading all sketches into memory in order to index.  
See 'index' documentation for details.  

Additionally, I’ve noticed issues when using sourmash scripts fastmultigather with a txt file containing database locations. It works when the paths point to .zip or .sig.gz files but fails when pointing to multiple RocksDBs or when mixing formats in the list. (Maybe I misunderstood something in this guide?)
Below are the three configurations I tested:

  • Working (with with JSON fallback warning):

    /database/genbank2022_03/genbank-2022.03-fungi-k31.zip  
    /database/genbank2022_03/genbank-2022.03-protozoa-k31.zip  
    /database/genbank2022_03/genbank-2022.03-viral-k31.zip  
    /database/gtdb-rs214/gtdb-rs214-reps.k31.zip  
    
  • Not working:

    /database/gtdb-rs214/gtdb-rs214-reps.k31.rocksdb  
    /database/genbank2022_03/genbank-2022.03-viral-k31.rocksdb  
    /database/genbank2022_03/genbank-2022.03-protozoa-k31.rocksdb  
    /database/genbank2022_03/genbank-2022.03-fungi-k31.rocksdb  
    
  • Also not working:

    /database/gtdb-rs214/gtdb-rs214-reps.k31.rocksdb  
    /database/genbank2022_03/genbank-2022.03-fungi-k31.zip  
    /database/genbank2022_03/genbank-2022.03-protozoa-k31.zip  
    /database/genbank2022_03/genbank-2022.03-viral-k31.zip  
    

For now, loading all databases into memory is manageable as they don’t consume too much.
I’ll update if any other errors occur once the new RocksDB is fully indexed and I run it on the sample set. Thanks for your input!

@ctb
Copy link
Collaborator

ctb commented Dec 4, 2024

I'll take a look - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants