Database I/O and conformer/fingerprint storage #36

aparente-nurix · 2019-08-15T01:35:07Z

I'd like to use e3fp fingerprints on a very large database of molecules (~millions, possibly billions).

I was wondering if you had any benchmarks on speed and conformer/fingerprint storage sizes. Whats the largest dataset you've applied this to?

Thanks!

sethaxen · 2019-08-16T18:39:22Z

The most comprehensive benchmarks we've run with E3FP are in Table S3 and Figure S10 of the supplement of the paper. I've included them below:

The code that ran these benchmarks is here: https://github.com/keiserlab/e3fp-paper/tree/master/project/benchmark.

As you can see, we haven't rigorously benchmarked on more than 308,315 molecules (ChEMBL20). The runtime should scale linearly with database size. Note that when we scaled from 10,000 to 308,315 molecules, E3FP still takes on average ~83s and ~0.7s per molecule for conformer generation and fingerprinting, respectively. While runtime of fingerprinting scales sub-linearly with the number of heavy atoms, conformer generation scales super-linearly with the same heavy atoms, so if your database contains very large, flexible molecules (e.g. peptides), these will tend to take a long time to run conformer generation, and that could use up all of your processors.

sethaxen · 2019-08-16T19:44:50Z

Regarding storage sizes, I haven't run any benchmarks in this area. E3FP's default storage approach is described here. Since it's just a light wrapper of a scipy.sparse.csr_matrix, its performance will be limited by that format. On the databases we've used, we are able to just hold the database in memory until fingerprinting is completed, when we write it to a file. I suspect a database with fingerprints of billions of molecules will exceed the memory of most machines, so a different storage option will probably be necessary, perhaps something like HDF5. I'm happy to take suggestions and pull requests in this area.

mjke · 2019-08-16T19:56:50Z

great points. a couple thoughts:

for conformer generation, if speed is a concern, you might consider commercial packages like omega; e3fp doesn't fundamentally rely on our particular choice of confgen tool.
for more flexible storage formats, perhaps n5 or zarr

sethaxen added the question label Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database I/O and conformer/fingerprint storage #36

Database I/O and conformer/fingerprint storage #36

aparente-nurix commented Aug 15, 2019

sethaxen commented Aug 16, 2019 •

edited

Loading

sethaxen commented Aug 16, 2019

mjke commented Aug 16, 2019

Database I/O and conformer/fingerprint storage #36

Database I/O and conformer/fingerprint storage #36

Comments

aparente-nurix commented Aug 15, 2019

sethaxen commented Aug 16, 2019 • edited Loading

sethaxen commented Aug 16, 2019

mjke commented Aug 16, 2019

sethaxen commented Aug 16, 2019 •

edited

Loading