Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting model fits from cache in a simulation setting (where each model is given a different seed) #361

Open
coenvdm opened this issue Aug 5, 2022 · 6 comments

Comments

@coenvdm
Copy link

coenvdm commented Aug 5, 2022

Describe the problem with the documentation

I am working with STAN models in a simulation study, using PyStan, where I implement the same model multiple times with different values for random_seed. I noticed that after fitting, the fit is saved to my cache folder under httpstan/4.4.2/models/“model_name”/fits/“fit_name”.

The problem I run into is that my memory gets cluttered by these files, while I don’t need them. I have tried clearing the folders containing these files manually, but since I am using parallelization, I cannot just delete entire folders on the go.

Is there a way to delete fit-files after I retrieve the posterior samples that I want, or keep Stan from saving these files?
I tried using the delete_fit-function from httpstan.cache, which requires you to specify an identifier for the (e.g. model_name), which is easy to obtain, and an identifier for the fit (e.g. fit_name), which I am not sure how to obtain (there is a calculate_fit_name-function in httpstan.fits, but I cannot get it to work). The documentation on how to use these functions (calculate_fit_name and delete_fit) is not clear to me.

Suggest a potential alternative/fix

Could you provide a use case on how to delete model fits from cache (in a setting where a new model is fitted within each iteration of a for-loop)?

@ahartikainen
Copy link
Contributor

Tbh, I don't think we have a way to access the identifier which means there is no good way to do this.

I don't remember if there is any way to turn of the caching.

Have you tried CmdStanPy or do you need logp / grad?

@coenvdm
Copy link
Author

coenvdm commented Aug 5, 2022

I haven't tried CmdStanPy. Would you expect that to have a fix for this problem? Sorry, I am not sure what log/grad is, could you explain?

@riddell-stan
Copy link
Contributor

There's probably a fix for this. If you have access to a larger (ephemeral) disk, you can set your user cache directory so it uses this disk. I think the environment variable is XDG_CACHE_HOME.

You could also create some kind of cron job or run another helper script in the background that deletes things.

@eboileau
Copy link

eboileau commented Nov 3, 2022

Hi, I'm also having this problem, in fact not even changing the seed. I am testing using a simple regression model, built and pickled using random data using one script, then in another script I load the model, build it again with new data, and sample from it e.g.

import stan
import pickle
import numpy as np

data = {...}
model = pickle.load(open("model.pkl", "rb"))
posterior = stan.build(model.program_code, data=data, random_seed=101)
fit = posterior.sample(num_chains=4, num_samples=1000, num_warmup=500, num_thin=1)

Each time I run this script (assuming pystan finds the model, and the cache has not been cleaned, in which case it will have to build again from scratch), pystan actually writes num_chains files to cache (under fits, one for each chain)... so you can imagine how quickly hundreds of files can quickly accumulate...

Having an option to NOT cache the fits, i.e. keep Stan from saving these files would be great...

@riddell-stan
Copy link
Contributor

There is a change I would welcome here: evict/delete old cached fits if the cache grows beyond a certain limit. In short, intelligently manage the cache.

It's difficult to come up with a robust caching policy. For this reason, we haven't made adding this feature a priority.

@eboileau
Copy link

eboileau commented Nov 3, 2022

@riddell-stan Thanks for your quick reply.

I don't know how difficult that would be, but an ideal solution would be to have some option to sample e.g.

fit = posterior.sample(..., cache=False)

but any improvement as you mention is obviously welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants