-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mini sbibm #1335
Mini sbibm #1335
Conversation
Alright, on the current examples, the output looks like this: Runtime linearly increases with a number of train simulations (currently 2k ~ 10 min on my laptop, with 1k, it was like 5 min). It would maybe also be nice to print runtimes on the right. Overall runtime, of course, also depends on how many different methods should be included. I think some limited control over what is run would be nice i.e pytest --bm # All base inference classes on defaults (similar to current behavior)
pytest --bm=NPE # NPE with e.g. different density estimators
pytest --bm=SNPE # SNPE_ABC 2 round test
... Either way, there needs to be a limit on what is run, and every configuration should finish in a reasonable amount of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this is really great to have - thanks a lot for pushing this!
Love the relative coloring of the results 🎉
Added a couple of comments and questions for clarification.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1335 +/- ##
===========================================
- Coverage 89.38% 78.28% -11.11%
===========================================
Files 119 119
Lines 8905 8905
===========================================
- Hits 7960 6971 -989
- Misses 945 1934 +989
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the edits!
All looks very good. I just have a couple of suggestions for renaming and removing comments.
The samples files are small, so no need to save them via git-lfs
?
Thanks for the review. In total, the sample files are 2.78 MB, so small but not super small (we could reduce this by saving only 1000 out of the 10_000 posterior samples (we anyway only use 1k for evaluation currently). |
Yes, this will live in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good now, thanks a ton! 🙏
What we would need is one place with a bit of documentation on how to use this. It would be for developers, so no need to make a tutorial. But maybe add a paragraph about this to contribute.md
, e.g., where we write about the tests?
Yeah, the contribute.md is a good place to do this. And yeah, should not interfere with packaging just adds a bit to the git. |
This should now be ready to be merged. The last version, "xdist," support, did unfortunately not print out the results to the console and saved them in pickle files. The new version improved the xdist support:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for pushing this. This is great!
What does this implement/fix? Explain your changes
This is a draft for some "benchmarking" capabilities integrated into sbi.
With
pytest
, we can roughly check that everything works by passing all tests. Some tests will ensure that the overall methodology works "sufficiently" well on simplified Gaussian analytic examples. Certain changes might still pass all tests but, in the end, negatively impact the performance/accuracy.Specifically, when implementing new methods or, e.g., changing default parameters, it is important to check that what was implemented not just only passes the tests but that it works sufficiently well.
Does this close any currently open issues?
Prototype for #1325
Any relevant code examples, logs, error output, etc?
So it now should work that one simply has to use:
Which is a custom tag that will disable testing and instead switch to a "benchmark" mode, which will only run tests that are marked as such and will always pass. Instead, these tests cache a metric on how well an implemented method solved a specific task (currently some examples in "bm_test.py").
Once it finishes, instead of passed/failed, it will return a table with the metric (we still can kinda color some methods that are worse than expected).
Any other comments?