You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The full k=21/31/51 and scaled=1000 files are not available for download except by special request; on farm, they are available at /group/ctbrowngrp5/2025-genbank-eukaryotes. I would not suggest using them 😅 as they are quite large.
RocksDB index of GTDB+Eukaryotes, for fast search with fastmultigather and manysearch
There is a k=51 scaled=10_000 RocksDB index suitable for fastmultigather and manysearch here:
for sketching, I used the code in ctb/2025-ncbi-rest-api#1 to get a list of all euk genomes (in various subsets) and sketch them with directsketch at k=21, k=31, k=51, and a scaled=1000.
** loading from 'entire-2025-01-21.mf.csv'
path filetype: StandaloneManifestIndex
location: entire-2025-01-21.mf.csv
is database? yes
has manifest? yes
num signatures: 616184
** examining manifest...
total hashes: 3158566951
summary of sketches:
19556 sketches with DNA, k=51, scaled=10000 1102231815 total hashes
596628 sketches with DNA, k=51, scaled=1000, abund 2056335136 total hashes
where the 19,556 sketches are the eukaryotic ones, and the 596,628 are from GTDB.
The text was updated successfully, but these errors were encountered:
eukaryotic genome databases are now available on farm as well as for download 🎉 .
These databases contain (almost) every reference genome under NCBI taxonomy node 2759 - a total of 19,215 references.
TODO:
files
location:
/group/ctbrowngrp5/sourmash-db/genbank-euks-2024.01/
The .sig.zip databases are here:
they are available for download here:
bilateria-minus-vertebrates.k51.sig.zip
eukaryotes-other.k51.sig.zip
eukaryotes.lineages.csv
fungi.k51.sig.zip
metazoa-minus-bilateria.k51.sig.zip
plants.k51.sig.zip
vertebrates.k51.sig.zip
The full k=21/31/51 and scaled=1000 files are not available for download except by special request; on farm, they are available at
/group/ctbrowngrp5/2025-genbank-eukaryotes
. I would not suggest using them 😅 as they are quite large.RocksDB index of GTDB+Eukaryotes, for fast search with fastmultigather and manysearch
There is a k=51 scaled=10_000 RocksDB index suitable for fastmultigather and manysearch here:
/group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.rocksdb
and an associated pair of lineages files (NCBI Eukaryotes + GTDB Bacteria/Archaea) here:
so you can do
to get a comprehensive breakdown of your sample.
missing genomes
there are 26 reference genomes missing; they don't seem to be available on GenBank. current list is in this file:
/home/ctbrown/scratch3/2025-sourmash-eukaryotic-databases/collections/eukaryotes-missing.links.csv
build repos and scripts
for sketching, I used the code in ctb/2025-ncbi-rest-api#1 to get a list of all euk genomes (in various subsets) and sketch them with directsketch at k=21, k=31, k=51, and a scaled=1000.
the code here https://github.com/sourmash-bio/2025-sourmash-eukaryotic-databases was used to take those and build comprehensive subsets + lineage CSV at k=51 and scaled=10_000. Easy enough to add k=21 and k=31.
The RocksDB index (k=51, scaled=10_000) was built using the scripts here: https://github.com/ctb/2025-make-rocksdb-entire/
content summary
a summary of the manifest going into the RocksDB:
where the 19,556 sketches are the eukaryotic ones, and the 596,628 are from GTDB.
The text was updated successfully, but these errors were encountered: