Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pystan3 or CmdStanPy? #128

Closed
eboileau opened this issue Nov 4, 2022 · 3 comments
Closed

Pystan3 or CmdStanPy? #128

eboileau opened this issue Nov 4, 2022 · 3 comments
Assignees

Comments

@eboileau
Copy link
Contributor

eboileau commented Nov 4, 2022

PyStan 2 is not supported, we need to move to PyStan 3, but backwards-incompatible changes were introduced in PyStan 3.

PyStan 3

  • Data and random seed are provided earlier, to the build method. This means that we cannot compile/pickle models during installation, and use them later e.g. to estimate metagene and orf profiles Bayes factors, as we previously did. We could in principle compile them with random data at install (to fill the httpstan cache), but would still need to call build during fitting. Each call using new data doesn't recompile the model (see below re caching), however in practice it's not clear how much overhead this creates for hundreds of call. See e.g. here, here, etc.

  • PyStan 3 has automatic caching of compiled Stan models. Subsequent calls to build rely on cached models, so that pickling/loading doesn't really make sense anymore. Depending on cache management, however, we would need to make sure models still exists before fitting i.e before every run (it wouldn't be sufficient to compile them at install), and if we call build multiple times in parallel, can PyStan reliably find models?

  • PyStan 3 has automatic caching of samples. PyStan currently writes num_chains files to cache... There are some discussion about cache management, see here, but so far no option to avoid caching of samples. This could be problematic.

  • Microsoft Windows is definitely not supported in PyStan 3.

CmdStanPy

Contrary to PyStan (interfaces with the Stan C++ library directly in memory), CmdStanPy is a wrapper around CmdStan that communicates via file system. Installation is more involved: in addition to Python3, it requires CmdStan and a C++ toolchain. I did a conda install into an existing environment, and this did NOT install required dependencies i.e. I still had to install CmdStan. The default install location is a hidden directory under $HOME. However, after creating a fresh environment conda create -n cmdstan -c conda-forge cmdstanpy, this worked out as expected... this needs to be tested again, but in principle Rp-Bp environment installation via conda with CmdStanPy would work. A DIY, GitHub or PyPI installation would obviously require more work.

  • A first call to CmdStanModel allows model instantiation (we could pass a compile=force option), if the executable has a newer timestamp, the model is not recompiled. In all cases, pickling/loading models does not make sense anymore, and we have to call CmdStanModel to instantiate a model. This does not allow, as far as I can see, to write hpp/exe files to another directory of choice, so
    we need to determine how to handle this (previously models were under rp-bp/rpbp_models, and compiled to an operating system-specific location e.g. user_data_dir from the appdirs package).

  • In general, we could think how to compile models as part of the bioconda package build process (see macos ci #126), but model instantiation/compilation appears to be much quicker here than with PyStan 2.

  • We can access the CmdStanMCMC object, and extract sampler outputs. But the sampler output files are also written to a temporary
    directory (deleted upon session exit unless the output_dir argument is specified).

  • There are more parallelization option via multi-threaded processing and cross-chain multi-threading. Previously we used n_jobs=1.

Stan

I just noticed that...

Deprecation of lp__
As of Stan version 2.0, the use of lp__ is deprecated. It will be removed altogether
in a future release, but until that time, programs still work with lp__. A deprecation
warning is now printed every time lp__ is used

If this is really the case, I don't know what we do... we use it to calculate the Bayes factor.


Changing to CmdStanPy would require some changes. Although tests fail (test-all.sh), CmdStanPy works fine using some toy examples, but before making a silly decision, I want to try to have a minimal set-up to run e.g. metagene profile periodicity estimation, etc.
but currently one model fails to compile...

@eboileau eboileau self-assigned this Nov 4, 2022
@eboileau
Copy link
Contributor Author

eboileau commented Nov 9, 2022

Ok, everything is changed to CmdStanPy API. I also took this opportunity to trim down the output, in particular "periodic-offsets.csv.gz" (offset with the best bayes factor mean were not used anymore), and fitted parameters from "bayes-factors.bed.gz".

Besides, I also noticed that the bayes_factor_var was ill-defined for the metagene profile estimates, I don't know what impact this might have had in the past...? It is used to parameterizes the normal used to test the likelihood that BF > min_bf_likelihood.

A few remarks:

  • We compile using cpp_options = {'STAN_THREADS': 'TRUE'} by default, and this can be changed via argument, though this might not be ideal, we can easily change that later. With Conda, what we should probably do is to distribute the compiled models as executables (if we use exe_file instead if stan_file to instantiate models w/o compilation, maybe we can ignore the cpp_options, and I guess this overrides compile=True, but we need to check the docs and test if this works.)
  • The models (Stan files) provided with the package (under rp-bp/rpbp_models) are copied to models_base (user_data_dir from the appdirs package) and compiled there, as before, or under os.environ["CONDA_PREFIX"]/share if the package is installed in a Conda environment. The later is obviously much better, as it allows to separate the compiled models from different installations. Even if we distribute the compiled models, we could keep the same structure here.
  • Do we need to check Stan fit diagnoses? As we still use with suppress_stdout_stderr(), I don't know if this makes sense, and how much overhead this would create?

@CDieterich
Copy link

Dear @eboileau - could you please explain a little better what bayes_factor_var was ill-definded means ?
In terms of consequences, one could comapare an old and a new release of rp-bp on the same data, which I would highly recommend in any case.

@eboileau
Copy link
Contributor Author

The variance should be $\sigma_p + \sigma_n$, but had apparently always been coded as $\sigma_p + \sigma_p$. We don't filter directly on the variance, but it is used to fit a normal distribution for the likelihood of each model, then we filter as usual using $P(\log BF_Y > 5) > 0.5$.

I've already made several test runs, but while results differ, it is difficult to pinpoint what exactly has the greatest effect, since the current or "new" rp-bp relies on latest Stan and CmdStanPy ( vs. older Stan and PyStan 2), latest versions of scipy, numpy, etc. Although deterministic estimates (incl. $\chi^2$) are always equal, variance estimates vary a lot when there is little data to support inference ( e.g. profiles deemed periodic, but in fact with a few reads only ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants