Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template cache searches are too slow #302

Open
mark-mackey-cresset opened this issue Sep 12, 2023 · 7 comments
Open

Template cache searches are too slow #302

mark-mackey-cresset opened this issue Sep 12, 2023 · 7 comments

Comments

@mark-mackey-cresset
Copy link

Currently, when a molecule is passed to a template generator it's looked for in the cache (if one is provided). The algorithm for doing this involves creating a Molecule object for each SMILES string in the cache, and then comparing its topology to the molecule being parameterised. In our benchmarks, this means that only around 10 or so cache entries are checked per second, as Molecule creation is expensive. If the cache has been in use for a while, it may well have several thousand entries in it, which means that checking the cache can take a lot more time than just doing the parameterisation (even including the AM1-BCC compute time).

Given that this is just an identity check, I can't see any reason why at least the initial check on the cache isn't just a SMILES string comparison, which would speed things up by several orders of magnitude.

@jchodera
Copy link
Member

This is a great idea! Apologies it's taken us so long to address.

@mikemhenry : Are you able to take a stab at this?

@mark-mackey-cresset
Copy link
Author

The only possible fly in the ointment is going to be choosing whose unique SMILES string to use. I imagine most users are either using RDkit or OE exclusively, but if anyone's switching you could end up with cache keys that are a mixture of SMILES from the two, which is unlikely to be useful :). I think you'll have to store which toolkit a given key was generated with, and for entries that don't have a key for the current toolkit fall back to the current slow compare-the-Molecule-object code. Alternatively use InCHI for the keys, but that's might require brining in an extra dependency?

@jchodera
Copy link
Member

jchodera commented Oct 26, 2023

Great point.

We should use the OpenFF Molecule.to_smiles() to generate the SMILES string for lookup, which will produce a canonical isomeric SMILES representation that we can cache internally and reproducibly---at least as long as they are still using the same toolkit backend.

Even if someone ends up using the same cache with different toolkit backends, at worst, you'll get a separate copy of each molecule for each toolkit backend, both both should be correct, which will only slow things down a bit.

@mikemhenry mikemhenry self-assigned this Oct 31, 2023
@mikemhenry
Copy link
Collaborator

Can do, I think using Molecule.to_smiles() does make the most sense for the key. I've ran into issue juggling different chem informatic toolkits so I think sticking to OpenFF makes the most sense.

@mikemhenry
Copy link
Collaborator

Just wanted to add that this hasn't fallen off my radar, I am working on a benchmark script so I can measure the speed-up

@mattwthompson
Copy link
Collaborator

You might want to consider mapped SMILES or even CMILES as the hash, especially if you end up using @functools.lru_cache. We recently ran into a critical-yet-esoteric bug that mapped SMILES would have safeguarded against.

@jchodera
Copy link
Member

Great suggestion, @mattwthompson !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants