Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eukaryotic genome databases now available! #3504

Open
4 tasks
ctb opened this issue Jan 22, 2025 · 0 comments
Open
4 tasks

eukaryotic genome databases now available! #3504

ctb opened this issue Jan 22, 2025 · 0 comments
Labels
fyi Information that is interesting or useful

Comments

@ctb
Copy link
Contributor

ctb commented Jan 22, 2025

eukaryotic genome databases are now available on farm as well as for download 🎉 .

These databases contain (almost) every reference genome under NCBI taxonomy node 2759 - a total of 19,215 references.

TODO:

  • rename them with date
  • add to databases page
  • investigate and summarize missing tax
  • investigate and summarize missing genomes

files

location: /group/ctbrowngrp5/sourmash-db/genbank-euks-2024.01/

The .sig.zip databases are here:

-rw-rw-r-- 1 ctbrown datalabgrp 4.0G Jan 21 11:03 vertebrates.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 1.7G Jan 21 09:05 bilateria-minus-vertebrates.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 1.3G Jan 21 08:39 plants.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 165M Jan 21 08:03 fungi.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp  81M Jan 21 08:06 metazoa-minus-bilateria.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp  56M Jan 21 08:08 eukaryotes-other.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 3.9M Jan 21 06:57 eukaryotes.lineages.csv

they are available for download here:

bilateria-minus-vertebrates.k51.sig.zip
eukaryotes-other.k51.sig.zip
eukaryotes.lineages.csv
fungi.k51.sig.zip
metazoa-minus-bilateria.k51.sig.zip
plants.k51.sig.zip
vertebrates.k51.sig.zip

The full k=21/31/51 and scaled=1000 files are not available for download except by special request; on farm, they are available at /group/ctbrowngrp5/2025-genbank-eukaryotes. I would not suggest using them 😅 as they are quite large.

RocksDB index of GTDB+Eukaryotes, for fast search with fastmultigather and manysearch

There is a k=51 scaled=10_000 RocksDB index suitable for fastmultigather and manysearch here:

/group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.rocksdb

and an associated pair of lineages files (NCBI Eukaryotes + GTDB Bacteria/Archaea) here:

/group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.csv
/group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb

so you can do

sourmash scripts fastmultigather QUERY.sig.zip \
   /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.rocksdb \
  -k 51 -s 10_000 -t 0 \
  -o QUERY.x.entire.csv

sourmash tax metagenome -g QUERY.x.entire.csv \
   -t /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb

to get a comprehensive breakdown of your sample.

missing genomes

there are 26 reference genomes missing; they don't seem to be available on GenBank. current list is in this file:

/home/ctbrown/scratch3/2025-sourmash-eukaryotic-databases/collections/eukaryotes-missing.links.csv

build repos and scripts

for sketching, I used the code in ctb/2025-ncbi-rest-api#1 to get a list of all euk genomes (in various subsets) and sketch them with directsketch at k=21, k=31, k=51, and a scaled=1000.

the code here https://github.com/sourmash-bio/2025-sourmash-eukaryotic-databases was used to take those and build comprehensive subsets + lineage CSV at k=51 and scaled=10_000. Easy enough to add k=21 and k=31.

The RocksDB index (k=51, scaled=10_000) was built using the scripts here: https://github.com/ctb/2025-make-rocksdb-entire/

content summary

a summary of the manifest going into the RocksDB:

** loading from 'entire-2025-01-21.mf.csv'
path filetype: StandaloneManifestIndex
location: entire-2025-01-21.mf.csv
is database? yes
has manifest? yes
num signatures: 616184
** examining manifest...
total hashes: 3158566951
summary of sketches:
   19556 sketches with DNA, k=51, scaled=10000        1102231815 total hashes
   596628 sketches with DNA, k=51, scaled=1000, abund 2056335136 total hashes

where the 19,556 sketches are the eukaryotic ones, and the 596,628 are from GTDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fyi Information that is interesting or useful
Projects
None yet
Development

No branches or pull requests

1 participant