Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database I/O and conformer/fingerprint storage #36

Open
aparente-nurix opened this issue Aug 15, 2019 · 3 comments
Open

Database I/O and conformer/fingerprint storage #36

aparente-nurix opened this issue Aug 15, 2019 · 3 comments
Labels

Comments

@aparente-nurix
Copy link

I'd like to use e3fp fingerprints on a very large database of molecules (~millions, possibly billions).

I was wondering if you had any benchmarks on speed and conformer/fingerprint storage sizes. Whats the largest dataset you've applied this to?

Thanks!

@sethaxen
Copy link
Collaborator

sethaxen commented Aug 16, 2019

The most comprehensive benchmarks we've run with E3FP are in Table S3 and Figure S10 of the supplement of the paper. I've included them below:

Screen Shot 2019-08-16 at 11 20 45 AM

Screen Shot 2019-08-16 at 11 21 04 AM

The code that ran these benchmarks is here: https://github.com/keiserlab/e3fp-paper/tree/master/project/benchmark.

As you can see, we haven't rigorously benchmarked on more than 308,315 molecules (ChEMBL20). The runtime should scale linearly with database size. Note that when we scaled from 10,000 to 308,315 molecules, E3FP still takes on average ~83s and ~0.7s per molecule for conformer generation and fingerprinting, respectively. While runtime of fingerprinting scales sub-linearly with the number of heavy atoms, conformer generation scales super-linearly with the same heavy atoms, so if your database contains very large, flexible molecules (e.g. peptides), these will tend to take a long time to run conformer generation, and that could use up all of your processors.

@sethaxen
Copy link
Collaborator

Regarding storage sizes, I haven't run any benchmarks in this area. E3FP's default storage approach is described here. Since it's just a light wrapper of a scipy.sparse.csr_matrix, its performance will be limited by that format. On the databases we've used, we are able to just hold the database in memory until fingerprinting is completed, when we write it to a file. I suspect a database with fingerprints of billions of molecules will exceed the memory of most machines, so a different storage option will probably be necessary, perhaps something like HDF5. I'm happy to take suggestions and pull requests in this area.

@mjke
Copy link
Member

mjke commented Aug 16, 2019

great points. a couple thoughts:

  • for conformer generation, if speed is a concern, you might consider commercial packages like omega; e3fp doesn't fundamentally rely on our particular choice of confgen tool.
  • for more flexible storage formats, perhaps n5 or zarr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants