-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pystan3 or CmdStanPy? #128
Comments
Ok, everything is changed to CmdStanPy API. I also took this opportunity to trim down the output, in particular "periodic-offsets.csv.gz" (offset with the best bayes factor mean were not used anymore), and fitted parameters from "bayes-factors.bed.gz". Besides, I also noticed that the A few remarks:
|
Dear @eboileau - could you please explain a little better what |
The variance should be I've already made several test runs, but while results differ, it is difficult to pinpoint what exactly has the greatest effect, since the current or "new" rp-bp relies on latest Stan and CmdStanPy ( vs. older Stan and PyStan 2), latest versions of scipy, numpy, etc. Although deterministic estimates (incl. |
PyStan 2 is not supported, we need to move to PyStan 3, but backwards-incompatible changes were introduced in PyStan 3.
PyStan 3
Data and random seed are provided earlier, to the
build
method. This means that we cannot compile/pickle models during installation, and use them later e.g. to estimate metagene and orf profiles Bayes factors, as we previously did. We could in principle compile them with random data at install (to fill the httpstan cache), but would still need to callbuild
during fitting. Each call using new data doesn't recompile the model (see below re caching), however in practice it's not clear how much overhead this creates for hundreds of call. See e.g. here, here, etc.PyStan 3 has automatic caching of compiled Stan models. Subsequent calls to build rely on cached models, so that pickling/loading doesn't really make sense anymore. Depending on cache management, however, we would need to make sure models still exists before fitting i.e before every run (it wouldn't be sufficient to compile them at install), and if we call build multiple times in parallel, can PyStan reliably find models?
PyStan 3 has automatic caching of samples. PyStan currently writes
num_chains
files to cache... There are some discussion about cache management, see here, but so far no option to avoid caching of samples. This could be problematic.Microsoft Windows is definitely not supported in PyStan 3.
CmdStanPy
Contrary to PyStan (interfaces with the Stan C++ library directly in memory), CmdStanPy is a wrapper around CmdStan that communicates via file system. Installation is more involved: in addition to Python3, it requires CmdStan and a C++ toolchain. I did a
conda install
into an existing environment, and this did NOT install required dependencies i.e. I still had to install CmdStan. The default install location is a hidden directory under$HOME
. However, after creating a fresh environmentconda create -n cmdstan -c conda-forge cmdstanpy
, this worked out as expected... this needs to be tested again, but in principle Rp-Bp environment installation via conda with CmdStanPy would work. A DIY, GitHub or PyPI installation would obviously require more work.A first call to
CmdStanModel
allows model instantiation (we could pass acompile=force
option), if the executable has a newer timestamp, the model is not recompiled. In all cases, pickling/loading models does not make sense anymore, and we have to callCmdStanModel
to instantiate a model. This does not allow, as far as I can see, to write hpp/exe files to another directory of choice, sowe need to determine how to handle this (previously models were under rp-bp/rpbp_models, and compiled to an operating system-specific location e.g.
user_data_dir
from the appdirs package).In general, we could think how to compile models as part of the bioconda package build process (see macos ci #126), but model instantiation/compilation appears to be much quicker here than with PyStan 2.
We can access the CmdStanMCMC object, and extract sampler outputs. But the sampler output files are also written to a temporary
directory (deleted upon session exit unless the output_dir argument is specified).
There are more parallelization option via multi-threaded processing and cross-chain multi-threading. Previously we used
n_jobs=1
.Stan
I just noticed that...
If this is really the case, I don't know what we do... we use it to calculate the Bayes factor.
Changing to CmdStanPy would require some changes. Although tests fail (test-all.sh), CmdStanPy works fine using some toy examples, but before making a silly decision, I want to try to have a minimal set-up to run e.g. metagene profile periodicity estimation, etc.
but currently one model fails to compile...
The text was updated successfully, but these errors were encountered: