Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementation details docs page, note on propto=True for log_density #192

Merged
merged 3 commits into from
Dec 7, 2023

Conversation

WardBrian
Copy link
Collaborator

This is a bit of a weird thing I've been thinking about recently that it is good to alert users for. It's related to #165 and #180.

Basically propto=True requires we pass vars to Stan. Before #165, we were even calling grad, wasting a lot of computation. Since #165, we no longer call grad, but it still may lead to more work than you'd expect, since the Stan math library assumes (justifiably) that if you're calling a function with vars you will want gradients, and so it can do some pre-computation for you. Because we never call/use grad() in log_density, this is wasted effort.

The big offenders are the higher order functions like reduce_sum, which basically calculate their entire gradients in the "forward pass".

I wrote up a docs page on implementation details and added a section about this. I'm not sure if it should be linked other places or not.

@WardBrian WardBrian added the documentation Improvements or additions to documentation label Dec 5, 2023
@WardBrian
Copy link
Collaborator Author

Here is a demo of what I'm describing, using this model

import bridgestan
import timeit

model = bridgestan.StanModel('sir.stan', 'sir.data.json')

with open('sir.init.json') as f:
    params = model.param_unconstrain_json(f.read())


def one():
    model.log_density(params, propto=True)

def two():
    model.log_density(params, propto=False)

timeit.timeit(one, number=1000) # warms-up caches etc

time_one = timeit.timeit(one, number=1000)
time_two = timeit.timeit(two, number=1000)

print(f"propto=T: {time_one*1000:.1f}ms")
print(f"propto=F: {time_two*1000:.1f}ms")

This prints

propto=T: 329.7ms
propto=F: 81.4ms

So, over 4 times slower for this model!

Copy link
Collaborator

@bob-carpenter bob-carpenter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK as is, but I added some suggestions and am happy to re-review.

``log_density`` with ``propto=true``
------------------------------------

The log density function provided by a Stan model has
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would phrase this as having the ability to drop constants. Then I'd give simple recommendations:

  1. If you're running MCMC that needs gradients and only density up to proportion, then use propto = true. Setting propto=true will be at least as fast.
  2. To evaluate the log density on double values to match, use propto=true. Setting propto=true may be slower or faster, depending on the cost of calculating normalizing constants (propto=false) and the cost of autodiff (required to get the right answer if propto=true).

I don't think we need to say much more than that.

Why the double back ticks?

I couldn't understand what lines 34/35 were doing.

I'd just give simple recommendations:

  • doing gradient-based calculations with autodiff: use propto because it's faster
  • doing log density evals without autodiff: depending

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double back ticks are how ReStructuredText (that sphinx uses) wants code formatted. They're equivalent to single backticks in Markdown. Lines 34/35 are also a RST detail to get a link that is also code formatted. The result is that "|reduce_sum|" above gets rendered as reduce_sum

I agree phrasing it in terms of a suggestion for each case is clearer. I left the explanation in, but under a sub-heading for the curious.

Copy link
Owner

@roualdes roualdes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having things like this in BridgeStan cause it is overall details about Stan that affect our users and I definitely want to keep it here. I just would rather offer the advice appropriate for our users and then point to Stan documentation. But in the end, I don't know where this exists in Stan documentation.

docs/internals/details.rst Outdated Show resolved Hide resolved
``propto=True`` will be at least as fast as setting ``propto=False``
and is generally recommended (and the default value).

However, in the case of the ``log_density`` function (which does not calculate
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to restate the "only needs the log density up to a proportion" bit again in this paragraph.

Copy link
Collaborator Author

@WardBrian WardBrian Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We're recommending setting it to False in this paragraph, which is safe for all usages

docs/internals/details.rst Show resolved Hide resolved
docs/internals/details.rst Outdated Show resolved Hide resolved
@WardBrian
Copy link
Collaborator Author

I think this is essentially absent from the Stan documentation, since calculating gradients is essentially taken for granted everywhere in Stan

@bob-carpenter
Copy link
Collaborator

I think this is essentially absent from the Stan documentation, since calculating gradients is essentially taken for granted everywhere in Stan

It's discussed in the efficiency section of the User's Guide and at length in the Reference Manual, which covers which things get autodiffed and which ones are just double-based.

@WardBrian
Copy link
Collaborator Author

Is there any place that would obviously lead the reader to the conclusion this discusses? e.g., that for the log_density function, propto can have dramatically different performance implications than it does for log_density_gradient?

@bob-carpenter
Copy link
Collaborator

Good point---we talk about everything you would need to draw this conclusion yourself, but I don't think we ever connect the dots. We probably should. I added an issue for the User's Guide efficiency chapter:

stan-dev/docs#692

@WardBrian WardBrian merged commit 97bcea3 into main Dec 7, 2023
19 checks passed
@WardBrian WardBrian deleted the docs/propto-true-notice branch December 7, 2023 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants